REGISTER TRANSFER LEVEL DESIGN OF COMPRESSION PROCESSOR
CORE USING VERILOG HARDWARE DESCRIPTION LANGUAGE
ROSLEE BIN MOHD SABRI
UNIVERSITI TEKNOLOGI MALAYSIA
iii
Specially dedicated to
my beloved wife and family
iv
ACKNOWLEDGEMENTS
First and foremost, I would like to extend my deepest gratitude to my main
project supervisor, Professor Dr. Mohamed Khalil bin Mohd Hani, for giving me the
opportunity to work on new areas of digital system design.
His constant
encouragement, critics and guidance were key to bringing this project to a fruitful
completion. I have learnt and gained much, not only in the field of research, but also
in the lessons of life.
My sincerest appreciation goes out to all those who have contributed directly
and indirectly to the completion of this research and thesis. Of particular mention is
Ms Hau Yuan Wen for her guidance, advices and motivations.
Without her
continued support and interest, this project and thesis would not have been the same
as presented here.
I would also like to recognize the support of my fellow postgraduate students.
My sincere appreciation also extends to all my colleagues and others who have
provided assistance at various occasions. Their views and tips are useful indeed. At
the same time, the constant encouragement and camaraderie shared between all my
friends during my postgraduate studies has been an enriching experience.
Finally, I would like to express my love and appreciation to my wife who has
shown unrelenting care and support throughout this challenging endevour.
v
ABSTRACT
Throughput independent and parameterized data compression processor core
was designed to tackle the needs of high-speed data compression applications. The
design is based on combination of LZSS algorithm and Huffman coding, which
enables it to be used in compression of a wide variety of data types. However,
several design limitations exist.
The design uses several technology-dependent
modules that limit its hardware realization in alternative technologies. In addition, it
also suffers from an abnormal functional behavior when trying to decompress data
that contain sufficiently high redundancy.
In view of these limitations, design
enhancements are proposed. One of the proposed enhancements is to improve the
design portability to any hardware implementation technology. This is accomplished
through designing generic hardware to replace the technology-dependent modules
and utilizing conditional compilation approach to best decide the design realization
given the available resources and constraints. With this approach, IP cores designed
for the targeted technology should be used to take advantage of the efficient resource
utilization and proven design, which leads to faster time-to-market and minimizes the
integration risks and verification efforts of large systems. However, if the design
does not have access to IP cores, then generic modules can be instantiated but at the
expense of development cost. Another design enhancement offers a hardware patch
to fix the decompression core hardware bug. The issue was identified to originate
from writing and reading the same memory location simultaneously. As a solution,
the behavior of the memory controller and its supporting logic are modified to
prevent this from occurring. From the design simulation results, it is concluded that
the decompression core hardware bug is finally solved.
.
vi
ABSTRAK
Satu perkakasan pemadatan data yang berparameter serta mempunyai kadar
pemprosesan bebas telah direka untuk menangani keperluan aplikasi pemadatan data
berkadar laju. Rekaan ini dicipta berasaskan kepada kombinasi algoritma LZSS and
kod Huffman, yang membolehkan ia digunakan untuk memproses pelbagai jenis
data. Namun, wujud beberapa kelemahan dalam rekaan tersebut. Ia menggunakan
beberapa modul yang bergantung kepada teknologi khusus yang menghadkan
pengrealisasian perkakas jika digunakan dalam sistem teknologi yang berbeza.
Selain dari itu, rekaan tersebut juga mempunyai masalah kelakuan fungsi proses
menidak-padat yang tidak normal apabila ia digunakan untuk memproses data yang
mempunyai kadar ulangan yang tinggi. Justeru, beberapa cadangan membaik pulih
rekaan dihasilkan untuk menangani kelemahan-kelemahan tersebut.
Salah satu
cadangan baik pulih yang dikemukakan membolehkan rekaan tersebut direalisasikan
dalam pelbagai jenis teknologi perkakasan. Perkara ini berjaya dihasilkan menerusi
rekaan perkakas generik untuk menggantikan modul yang bergantung kepada
teknologi khusus dan menggunakan cara kompilasi bersyarat untuk merealisasikan
rekaan berdasarkan kepada sumber dan maklumat yang ada. Menerusi cara ini,
perkakas IP yang direka untuk teknologi tertentu dapat digunakan bagi memastikan
hasil yang efektif dan berkesan, sekaligus memendekkan kadar masa yang diperlukan
untuk memasarkan produk dan mengurangkan risiko pengalihan dan pengesahan
maklumat bagi sistem-sistem yang besar.
Namun, sekiranya rekaan ini tidak
mempunyai akses kepada perkakas IP, modul generik boleh digunakan tetapi
melibatkan kos pembangunan yang tinggi. Satu lagi cadangan baik pulih ialah
menawarkan tampalan perkakasan untuk membaiki pepijat dalam proses menidakpadat.
Masalah ini dikenalpasti berpunca apabila sistem memori diakses untuk
menulis dan membaca pada masa yang sama. Untuk menyelesaikan perkara ini,
kelakuan pengawal memori dan logik sokongannya diubahusai untuk mengelakkan
masalah tersebut dari berlaku. Dari keputusan simulasi rekaan, dapat disimpulkan
bahawa pepijat dalam proses menidak-padat data telah berjaya diselesaikan.
vii
TABLE OF CONTENTS
TITLE
CHAPTER
1
PAGE
DECLARATION
ii
DEDICATION
iii
ACKNOWLEDGEMENTS
iv
ABSTRACT
v
ABSTRAK
vi
TABLE OF CONTENTS
vii
LIST OF TABLES
xi
LIST OF FIGURES
xii
LIST OF SYMBOLS
xv
LIST OF APPENDICES
xvi
INTRODUCTION
1
1.1
Background
1
1.2
Problem Statement
3
1.3
Objectives & Scope of Works
4
1.4
Literature Review
5
1.4.1
6
Lempel-Ziv-Storer-Szymanski (LZSS)
Compression Algorithm
1.5
1.4.1.1 Notations and Definition
7
1.4.1.2 LZSS Encoding Process
8
1.4.1.3 LZSS Decoding Process
8
1.4.2
Huffman Coding Algorithm
9
1.4.3
High-Speed Data Compression Core Design
10
Thesis Organization
13
viii
1.6
2
Summary
14
METHODOLOGY
15
2.1
Research Procedure
15
2.2
Design Verification Strategy
17
2.2.1
Functional and Timing Simulations
18
2.2.2
Real-Time Hardware Testing
22
2.3
Tools and Techniques
23
2.3.1 Verilog Hardware Description Language
23
2.3.2 C Software Programming
24
2.3.3 Altera System-On-Programmable-Chip
24
(SOPC) Builder Tool
2.4
3
4
2.3.4 Altera Quartus II Tool
25
2.3.5 Altera Excalibur Hardware Development Kit
25
Summary
26
DESIGN OF DATA COMPRESSION HARDWARE
27
3.1
Overview of the Compression Hardware Design
27
3.2
Design of Compression Unit
28
3.2.1
LZSS Coder
29
3.2.2
Fixed Huffman Coder
35
3.2.3
Data Packer
36
3.3
Design of Compression Interface
38
3.4
Overview of the Decompression Hardware Design
41
3.5
Design of Decompression Unit
42
3.5.1
Data Unpacker
43
3.5.2
Fixed Huffman Decoder
44
3.5.3
LZSS Expander
45
3.6
Design of Decompression Interface
49
3.7
Summary
51
DESIGN MODIFICATIONS & HARDWARE
ENHANCEMENTS
52
ix
4.1
Hardware Portability Issue
52
4.2
Generic Dual Port Memory Design
54
4.2.1
Design Specifications
54
4.2.2
Hardware Architecture
55
4.3
4.4
Generic Synchronous FIFO Design
56
4.3.1
Design Specifications
57
4.3.2
Hardware Architecture
57
Hardware Bug of Decompression Processor Core
59
Design
5
4.5
Details of Hardware Patch
64
4.6
Summary
68
DATA COMPRESSION SYSTEM HARDWARE
69
DEVELOPMENT
5.1
Data Compression System Hardware Architecture
69
Overview
6
5.2
System on Programmable Chip Development
71
5.3
Avalon Memory-Mapped Slave Interface Design
73
5.3.1
Interface Register Sub-module Design
76
5.3.2
Interface Controller Sub-module Design
78
5.4
Firmware C Programming
82
5.5
Summary
86
DESIGN SIMULATION AND HARDWARE TESTING
87
6.1
Design Simulation
87
6.1.1
Test Setup
87
6.1.2
Test Result of Compression Processor Core
89
6.1.3
Test Result of Decompression Processor Core
89
6.2
6.3
Hardware Testing
90
6.2.1
Test Setup
91
6.2.2
Test Result of Data Compression System
92
Performance Analysis
93
6.3.1
94
Performance Metrics
x
6.3.2
6.4
7
Performance Comparison
Summary
94
96
CONCLUSIONS
97
7.1
Concluding Remarks
97
7.2
Recommendations for Future Work
98
7.2.1
98
Improvement of LZSS Codeword Generation
Module
7.2.2
Employing Adaptive Huffman Coding
99
Technique
REFERENCES
101
Appendices A - J
104-181
xi
LIST OF TABLES
TABLE NO
1.1
TITLE
Design parameters of the compression and
PAGE
12
decompression processor cores.
3.1
Example of the Huffman encoding table
35
5.1
Address mapping of the slave peripheral
76
implemented by the interface register sub-module
6.1
Hardware parameters for compression and
88
decompression processor core design simulation
6.2
Results of data compression system hardware test
92
6.3
Benchmarking results of compression processor
95
core performance between VHDL and Verilog
design
6.4
Benchmarking results of decompression processor
core performance between VHDL and Verilog
design
95
xii
LIST OF FIGURES
FIGURE NO
1.1
TITLE
Compression and decompression approach of the
PAGE
10
processor core design
2.1
Overview of test bench approach for design
18
simulations
2.2
Concept of cross-checking between RTL and
19
behavioral models used in design verification
2.3
Verification environment setup for simulating
21
design functionality of compression and
decompression processor core top level modules
3.1
Functional block diagram of the compression
28
hardware
3.2
Block Diagram of Compression_Unit
29
3.3
Hardware architecture of the LZSS coder module
29
3.4
Connection between PEs
30
3.5
PE hardware structure
31
3.6
Reduction tree hardware structure
32
3.7
Delay tree hardware structure
33
3.8
ASM flowchart of codeword generator
34
3.9
Operation of data packer
37
3.10
Block diagram of the compression interface
38
3.11
State transition diagram of compression interface
40
input controller
3.12
Functional block diagram of the decompression
42
hardware
3.13
Block diagram of Decompression_Unit
43
xiii
3.14
Operation of data unpacker
43
3.15
State transition diagram of fixed Huffman
45
decoder
3.16
Hardware architecture of LZSS expander
46
3.17
State transition diagram of codeword analyzer
47
3.18
Hardware architecture of decompression
48
dictionary
3.19
Hardware architecture of symbol generator
49
3.20
Block diagram of the decompression interface
50
4.1
Hardware architecture of the dual port memory
56
4.2
Hardware architecture of the synchronous FIFO
59
4.3
Behavior of dual port memory for simultaneous
61
read and write accesses on different memory
locations
4.4
Behavior of dual port memory for simultaneous
63
read and write accesses on same memory
locations
4.5
Simplified state transition diagram of the
65
Expander_Codeword_Analyzer sub-module
4.6
Updated state transition diagram of the improved
66
Expander_Codeword_Analyzer sub-module
design
4.7
Behavior of the decompression processor core
68
dual port memory after implementation of design
improvement
5.1
Overall architecture of the data compression
70
system on a chip design
5.2
Description of Avalon bus slave interface read
74
transfer with one fixed wait-state mechanism
5.3
Description of Avalon bus slave interface write
75
transfer with one fixed wait-state mechanism
5.4
Overview of the Avalon bus slave interface
module
75
xiv
5.5
Bit mapping structure of interface register control
77
signals
5.6
Bit mapping structure of LZSS_Status signal
78
5.7
State machine diagram of the
79
LZSS_Interface_Controller sub-module
5.8
Write data transfer request by host system
81
5.9
Read data transfer request by host system
81
5.10
ASM flowchart of the compression process
84
firmware design
5.11
ASM flowchart of the decompression process
firmware design
85
xv
LIST OF SYMBOLS
ASIC
-
Application Sppecific Integrated Circuit
CAD
-
Computer Aided Design
CPLD
-
Complex Programmable Logic Device
CPU
-
Central Processing Unit
DSP
-
Digital Signal Processing
EDA
-
Electronic Design Automation
FIFO
-
First In First Out
FPGA
-
Field Programmable Gate Array
FSM
-
Finite State Machine
GUI
-
Graphical User Interface
HDL
-
Hardware Development Language
I/O
-
Input/Output
IP
-
Intellectual Property
LCD
-
Liquid Crystal Display
LED
-
Light Emitting Diode
LPM
-
Library of Parameterized Module
LZSS
-
Lempel Ziv Storer Szymanski
PLD
-
Programmable Logic Device
ROM
-
Read Only Memory
RTL
-
Register Transfer Level
SRAM
-
Static Random Access Memory
SoC
-
System-on-Chip
SOPC
-
System-on-Programmable-Chip
UART
-
Universal Asynchronous Receiver Transmitter
VHDL
-
Very High Speed Integrated Circuit Hardware Description
Language
VLSI
-
Very Large Scale Integration
xvi
LIST OF APPENDICES
APPENDIX
TITLE
PAGE
A
Example of LZSS Compression Algorithm
104-108
B
Example of Huffman Coding
109-110
C
Compression Processor Core Verilog Codes
111-132
D
Decompression Processor Core Verilog Codes
133-152
E
Generic Memory Module Verilog Codes
153-156
F
NIOS System Compression Processor Core
157-162
Verilog Codes
G
NIOS System Decompression Processor Core
163-168
Verilog Codes
H
Hardware Test System Firmware C Code
169-170
I
Design Simulation Output Waveform
171-178
J
Hardware Test Detailed Results
179-181
CHAPTER 1
INTRODUCTION
This project implements register-transfer-level design of a proprietary highspeed data compression and decompression processor cores using Verilog hardware
description language. In addition, this project also offers enhancements aimed at
improving the design portability to any hardware implementation technologies, as
well as solving a hardware bug of the decompression processor core design. In the
first chapter, overview of the project background is presented, followed by
discussions on the problem statement, project objectives as well as the scope of
work. An overview of the theory and knowledge involved is also presented. The
organization of this thesis is presented at the end of the chapter.
1.1
Background
In many computing applications, getting the maximum throughput from
limited resources is always desirable. For example, modern communication systems
normally have limitations on its transmission medium’s bandwidth utilization for
data transfers. To fully utilize the available bandwidth, traffic sent through the
medium should not contain any redundant information. This is not necessarily the
case however, because all source information has inherent redundancies in them
(Shannon, 1948). This means considerable amount of valuable resources would be
wasted if these redundancies are not removed when transmitting data over the limited
bandwidth medium.
2
In addition to efficient resource utilization, certain computing applications
require fast information processing and data manipulations in order for the system to
properly operate in real-time. For example, many wireless communication systems
operate in time division duplex mode, where windows of finite time duration are
allocated for data transfers between two communicating terminals. This means all
processing must be completed and the required data must be valid within this time
window to ensure proper communication takes place and to enable other data transfer
windows to be allocated. Else, the communication channel will break down and all
processed data will be rendered useless. Therefore, any improvements in ensuring
optimal physical resources utilization must also take into account the required
processing time of such improvements, so that the real-time performance of the
overall system is not degraded.
A cost effective way to efficiently utilize limited physical resources in highspeed computing applications is by compressing the information processed by such
applications.
Essentially, data compression techniques remove the inherent
redundant information in source data such that the information can be represented by
fewer bits.
Applying this technique in high-speed communication systems for
example, allows more information to be transferred over a limited bandwidth
medium compared to if original source data were to be transmitted. This would
increase the effective bandwidth utilization of the transmission medium and the
overall system performance.
However, data compression techniques are normally computationally
intensive, which means considerable amount of processing time is required to
achieve sufficient compression savings.
Therefore, to effectively apply data
compression techniques in high-speed computing applications, the complex
processing of the data compression algorithms must be done considerably fast to
enable the system operating in real-time with required performance. This means a
high-speed data compression solution is needed in order to effectively utilize limited
resources in any high-speed computing applications.
3
To fulfill this need, a proprietary high-speed data compression and
decompression processor cores were designed and developed by Universiti
Teknologi Malaysia (Yeem, 2002). The hardware uses data compression techniques
based on combination of Lempel-Ziv-Storer-Szymanski (LZSS) algorithm and
Huffman coding. It was designed as a parameterized module for easy configurability
that can provide suitable compromise between constraints of hardware resources,
processing speed and compression saving. In addition, both processor cores were
designed to be easily integrated with any memory-mapped bus systems, which
attractively lend itself for operation within virtually all modern systems utilizing
some kinds of processor architecture. Initially, both compression and decompression
processor cores were ported to an ALTERA programmable logic device, which is the
FLEX10KE field programmable gate arrays (FPGA).
As such, it uses several
ALTERA intellectual-property (IP) cores to ease design and development work, and
to take advantage of optimized resource utilization of the target device. The IP cores
used are the Library of Parameterized Module (LPM) first-in-first-out buffers (FIFO)
and dual port memories.
1.2
Problem Statement
The use of several IP modules targeted for specific technology means
hardware implementation of the compression and decompression processor cores is
limited to ALTERA programmable logic devices that support FIFO and dual port
memories used. At best, hardware implementation of the design in other logic
devices or technologies requires all IP modules to be replaced by equivalent
hardware in the device or technology of interest. However, if equivalent modules are
not available in target technology, or the design does not have access to IP modules
because of licensing requirements, hardware implementation would be considerably
difficult because the only solution is hardware redesign.
Moreover, being
technology-dependent means the design has less competitive advantage for
commercial applications simply because potential users require highly flexible
solution for procurement considerations and ease of system maintainability.
4
In addition to the limitation of the design’s hardware portability, it is found
that the functionality of the decompression processor core is not reliable. When
decompressing data with sufficiently high redundant information, the restored data
do not match its original source. When decompressing other sets of source data
however, the decompression processor core outputs match the original source bit-bybit. This inconsistent behavior means the decompression processor core design does
not meet its required functional specifications, which renders it useless in real
applications.
1.3
Objectives & Scope of Work
In view of the design limitations discussed in the previous section, the
objectives of this project are:
1) To improve the compression and decompression processor core hardware
portability to any programmable logic devices and/or process technologies.
2) To solve the abnormal behavior of the decompression processor core when
processing highly redundant data.
3) To develop a data compression system targeted to a prototyping hardware
platform for real-time design verification and performance analysis.
The scope of work of this project can be divided into three phases. The first
phase involves hardware redesign of the compression and decompression processor
cores using Verilog hardware description language. The parameterized nature of
original design will be kept to preserve its advantage in terms of hardware reconfigurability. Included in this project phase is the design of generic FIFO and
dual-port memory modules to replace the ALTERA IP cores used in original design.
The generic modules will enable both processor cores to be implemented in any
programmable logic devices and ASIC technologies without requiring any design
5
modifications. In addition, a hardware patch will be designed to solve the abnormal
functional behavior of the decompression processor core. With this fix, the design is
expected to meet its functional specifications for any level of source data
redundancies.
The second phase involves developing a stand-alone data compression system
by embedding the compression and decompression processor cores in a processorbased system. Both cores will function as secondary processors that implement the
required compression and decompression processing tasks to off-load the computing
constraints of the system processor. This system architecture allows faster processing
of the required data compression computations, so that the whole system can operate
in real-time especially for high-speed applications. In this phase, the work consists
of designing memory-mapped bus slave interfaces for both processor cores to enable
data transfers with the system processor, building the overall system by integrating
necessary components using a CAD tool, and implementing the system onto a
prototyping hardware platform.
The final phase of the project involves developing an embedded firmware
program that provides a mechanism for controlling and utilizing the compression and
decompression processor cores inside the system. The firmware will be written in C
programming language and will be an important tool in order to test the data
compression system running in actual hardware and in real-time. This will enable an
easier and more efficient evaluation of the design to verify its compliance to the
functional specifications.
1.4
Literature Review
This section discusses the theory and background knowledge involved in this
project. It provides a general overview of the compression algorithms used, which
are the LZSS compression algorithm and Huffman coding, followed by discussions
6
on the hardware architecture and design approach of the proprietary high-speed data
compression and decompression processor cores.
1.4.1
Lempel-Ziv-Storer-Szymanski (LZSS) Compression Algorithm
The main data compression technique chosen to be implemented in the design
is the Lempel-Ziv-Storer-Szymanski (LZSS) algorithm. It is a lossless compression
technique, which means no information is lost during the compression and
decompression process. Compared to lossy compression techniques, where some
information of source data is permanently lost in the process in favors for high
compression savings, the LZSS algorithm generally has lower compression
performance. However, for applications that cannot tolerate even a single bit of
information lost, this technique can provide such guarantee.
Therefore, LZSS
compression algorithm actually covers larger application scope, where data
compression techniques are concerned.
The LZSS algorithm is also known as a universal data compression
technique. This means the algorithm can be applied to any discrete source, and its
performance is comparable to certain optimal fixed code schemes designed for
completely specified sources (Lempel, 1977).
Using this technique, a priori
knowledge of the source data characteristics is not required since the algorithm
adaptively construct an optimal codeword representation of the source data. Coupled
this with larger varieties of data compression applications it can handle, the
proprietary data compression and decompression processor core design certainly has
good competitive advantage for commercial high-speed computing applications.
Using this compression technique, the source data are encoded as LZSS
codeword, represented as a pair of position-length pointer which points to parsed
strings in a dictionary or encoding table. The strings are basically repeating symbols
of source data that are stored inside the dictionary. The idea is to replace the
representation of the repeated strings with a form that only requires fewer bits than
7
the original data, thus representing the source with lesser number of bits. On the
decompression side, the LZSS algorithm adaptively regenerates the dictionary or
encoding table based on the compressed data characteristics. Therefore, transmission
of the dictionary is not required, which improves its processing speed and reduces
bandwidth requirement for communication systems.
1.4.1.1 Notations and Definition
Before the exact mechanics of the coding procedures are described, we need
to define terminologies used in LZSS algorithm.
Definition: The source data strings are over a finite alphabet A of α symbols, say A =
{0, 1, …, α−1}. A string S of length l(S) = k over A is an ordered k-tuple S = s1s2 …
sk of symbols from A. To indicate a substring of S which starts at position i and ends
at position j, we write S(i, j). When i ≤ j, S(i, j) = sisi+1 … sj , but when i > j, we take
S(i, j) = Λ, the null string of length zero.
Definition: The concatenation of strings Q and R forms a new string S = QR; if l(Q)
= k and l(R) = m, then l(S) = k + m, Q = S(1,k), and R = S(1+1, k+m). For each j, 0
≤ j ≤ l(S), S(1, j) is called a prefix of S; S(1, j) is a proper prefix of S if j < l(S).
Definition: Given a proper prefix S(1, j) of a string S and a positive integer i such that
i ≤ j, let L(i) denote the largest nonnegative integer l ≤ l(S) − j such that S(i, i+l−1) =
S(j+1, j+l), and let p be a position of S(1,j) for which L(p) = max1≤i≤jL(i). The
substring S(j+1, j+L(p)) of S is called the reproducible extension of S(1, j) into S, and
the integer p is called the pointer of the reproduction. For example, if S = 00101011
and j = 3, then L(1) = 1 since S(j+1, j+1) = S(1,1) but S(j+1, j+2) ≠ S(1,2).
Similarly, L(2) = 4 and L(3) = 0. Hence, S(3+1, 3+4) = 0101 is the reproducible
extension of S(1,3) = 001 into S with pointer p = 2.
8
1.4.1.2 LZSS Encoding Process
1) Set i = 1, and initialize an integer h; h0 = 0
2) Initialize buffer B with predefined symbols and first Ls symbols of the incoming
source stream, S; B0 = Xn – LsS(1, Ls), where X is the predefined symbol
3) For each i,
a. Determine the reproducible extension of Bi-1 (1, n–Ls) into Bi-1.
b. Compute the codeword, Ci, the integer hi, and update the contents of the
buffer, Bi:
i.
ii.
4)
If L(p) of the reproducible extension > 0 then
Ci = 1Ci1Ci2, where Ci1 = p – 1, Ci2 = L(p)
hi = hi-1 + L(p)
Bi = Bi-1 (1 + L(p), n) S(hi + 1, hi + Ls)
If L(p) of the reproducible extension = 0 then
Ci = 0Ci3, where Ci3 = Bi-1 (n–Ls+1, n–Ls+1)
hi = hi-1 + 1
Bi = Bi-1 (2, n) S (hi + 1, hi + Ls)
If hi < l(S), then i = i + 1 and go to Step 3. Else, STOP.
1.4.1.3 LZSS Decoding Process
1) Let Di denote the content of the buffer before i-th iteration of the algorithm,
where l(Di) = n – Ls.
2) Set i = 1, and initialize the buffer D with (n – Ls) predefined symbols, D1 = Xn–Ls,
X is the predefined symbol;
3) For each i,
a. Shift the contents of the buffer, Di:
i.
D’i = D’i (2, n–Ls) D’i (pi, pi)
; if Flag = 1 (*Note 1), or
ii.
D’i = D’i (2, n–Ls) Hi
; if Flag = 0 (*Note 2)
b. Compute the restored string, Si:
i.
Si = D’i (n–Ls–li–1+1, n–Ls)
; if Flag = 1
ii.
Si = D’i (n–Ls, n–Ls)
; if Flag = 0
c. Update the contents of the buffer: Di+1 = D’i
9
4) If Ci is the last codeword, then STOP. Else i = i + 1 and go to Step 2.
*
Note 1:
Determine the p i–1 and li–1 from the next [log2 (n–Ls)] and the next
[log2(Ls)] bits of Ci. Apply li–1shift, while copying the contents of
position pi in the buffer into the position n – Ls.
*
Note 2:
Determine the explicit symbol (said Hi) from the next l(Si(1,1)) bits of
Ci. Shift the buffer once, while copying the Hi into the position n–Ls
of the buffer.
Example of the encoding and decoding process of LZSS algorithm is
explained in Appendix A.
1.4.2
Huffman Coding Algorithm
Huffman coding is an entropy or statistical-based coding technique, where it
requires a priori knowledge of the source data distribution characteristics in order to
construct an optimal encoding table for better performance. It allows variable-length
codeword, where lesser bits are assigned to frequently occurring symbols, and more
bits are assigned to symbols that seldom occur. Effectively, the encoded data will
take fewer bits to be represented since most of the frequently used symbols of the
source have been replaced by shorter codes.
General procedure to construct Huffman codes is as follows:
1) Rank all symbols in order of probability of occurrence.
2) Successively combine the two symbols of the lowest probability to form a new
composite symbol; eventually we will build a binary tree where each node is the
probability of all nodes beneath it.
3) Trace a path to each leaf, noticing the direction at each node.
Appendix B describes the Huffman coding technique in details.
10
1.4.3
High-Speed Data Compression Core Design
The technique used in the design of high-speed data compression and
decompression processor cores is based on combination of LZSS compression
algorithm and Huffman coding. The source data to be compressed is first processed
by the LZSS compression technique since the algorithm is not restricted in what type
of data it can process, coupled with the fact that it requires no a priori knowledge of
the source. LZSS codeword is then generated whenever matches between the source
data and the dictionary elements are detected, where the encoded data are represented
as position-length pair codeword.
Generally, the length portion of the LZSS
codeword yielded by the algorithm is non-uniformly distributed, where smaller
lengths occur more frequently than longer ones (Yeem, 2002).
This suggests
Huffman coding be employed to further encode the length portion of LZSS
codeword in order to achieve higher compression saving. In the decompression side,
the whole process is performed in the reverse order. Figure 1.1 illustrates this
approach.
Compression
Source data
LZSS
Encoding
LZSS
Codeword
Huffman
Encoding
Compressed
data
Decompression
Compressed
data
Huffman
Decoding
LZSS
Codeword
LZSS
Decoding
Restored
data
Figure 1.1: Compression and decompression approach of the processor core design
11
The LZSS algorithm, however, involves computationally intensive matching
process during the compression stage because each input phrase has to be compared
with every possible phrase in the dictionary. Furthermore, the dictionary updating
process involves variable length shifting of the input source into the dictionary, since
the length of longest matched phrase changes with time. If this operation is done
using variable-length shifter, considerable amount of hardware resources will be
consumed, which can lead to higher implementation cost because bigger (and
correspondingly, more expensive) programmable logic device or ASIC silicon is
needed. The design tackles these problems through systolic array architecture of the
LZSS compression dictionary, where each input data is compared with every
dictionary elements simultaneously, while shifting input data is done one symbol at a
time through the use of a fixed-length shifter.
The Huffman coding technique also presents certain design challenges.
Conventional Huffman coding requires a priori knowledge of the source data
distribution characteristics in order to construct an optimal encoding table for better
performance. However, in many real-life applications, it is difficult to determine the
characteristics of source data because its probability distribution normally changes
with time.
Even when the source distribution statistics are available, different
sources have different distribution characteristics. The encoding table must then be
generated for each type of source data. Furthermore, the generated table must be
transmitted along with the encoded data so that decompression can be performed
correctly.
This would both reduce the compression saving and increase the
processing time of the hardware. The design tackles these problems by employing a
predefined Huffman encoding table for both compression and decompression cores.
The reason for this is two-fold; the first one is to simplify generation of the encoding
table since adaptively building the table for different source data is no longer
required. The second reason is to eliminate the need to transmit the encoding table to
the decompression side, so that inefficient resource utilization and degradation of
compression saving issues due to this encoding table transmission can be overcome.
The data compression core design also employs the reconfigurable and
reusable hardware concept. This design concept promotes the use of the existing
12
hardware in other application domains with different processing requirements
without significant modifications to the original design. As a result, it helps in
speeding up development cycle of large systems and lowering the cost of
implementation. The hardware design achieves this modularity through the use of
scaleable hardware architecture and parameterized design approach, that resulted in
configurable data compression and decompression processor cores based on suitable
compromises between the constraint of resources, speed and compression saving. In
addition, the design allows capability of integrating both processor cores with any
external memory-mapped systems, through the use of reconfigurable bus interfaces.
Table 1.1 describes the required design parameters and its effects on the generated
hardware in terms of resources, speed and compression saving trade-off, as well as
the suitable interfacing mechanism to the external system:
Table 1.1: Design parameters of the compression and decompression processor cores
Design Parameter
SymbolWIDTH
DicLEVEL
Description
Width of each input source symbol
The number of elements used to build the LZSS dictionary i.e.
2DicLEVEL
MAXWIDTH
The predefined maximum match size in parsing symbols into
one string i.e. 2MAXWIDTH - 1
IniDicValue
The predefined symbol stored in each dictionary element
when the dictionary is initialized.
InterfaceWIDTH
Width of the interfacing bus which is used to connect the
compression/decompression hardware with an external
interfacing system i.e. 2InterfaceWIDTH
PollAmount
Maximum number of data, each is 2InterfaceWIDTH, which is
transferred between the compression/decompression hardware
and external interfacing system within an interfacing phase.
13
1.5
Thesis Organization
The work in this thesis is organized into seven chapters. This first chapter
presents the research background and motivation, followed by its objectives and
scope of work. An overview of the theory and knowledge involved is presented,
before concluding with thesis organization.
Chapter two discusses the research methodology. It starts with discussions
on the design and verification approaches, followed by descriptions of the tools and
techniques used to complete the research work.
Chapter three describes the design of the data compression hardware. This
includes design details of both the compression and decompression processor cores,
as well as their respective interfaces to external host systems.
Chapter four explains the design modifications and hardware enhancements
proposed. It starts by discussing design details of the parameterized and generic
memory modules, followed by discussion on the conditional compilation approach
for the best compromise of hardware implementation. In addition, root cause of the
decompression processor core hardware bug is described and followed by detailed
explanation of the hardware modifications required to solve the issue.
Chapter five discusses the development of a processor-based, stand-alone
data compression system. An overview of the system development is presented,
focusing on the CAD tool used and the approach of embedding custom design with
predefined IP modules. In addition, this chapter describes the details of a memorymapped bus slave interface design to enable data transfers between the compression
and decompression processor cores and the system processor. This chapter also
describes the system firmware development, which will be used to test the overall
system running on actual prototyping hardware platform.
14
Chapter six describes the design simulation and hardware test that are
performed on both processor cores, as well as the complete system for functional
verification and validating the system performance operating in real-time.
In
addition, comparison with original design is discussed to evaluate the performance of
the proposed design enhancements.
Chapter seven summarizes the research work and states all deliverables of the
project. Recommendations for potential future works are also given.
1.6
Summary
In this chapter, introduction to the background, theories and motivation of
this project are discussed. Based on the discussions, objectives of the project are
identified which leads to the scope of work necessary to achieve the desired goals.
In the next chapter, research methodologies used in this project are described.
CHAPTER 2
METHODOLOGY
This chapter presents an overview of the research methodology used in this
project. It starts with discussions on the research procedure, followed by the design
verification strategies, and finally the tools and techniques used.
2.1
Research Procedure
The research project begins by studying the data compression algorithms
used and analyzing the hardware architecture of the proprietary compression and
decompression processor core designs. Once the hardware architecture and design
intent are understood, the project continues with redesigning the RTL models of
both processor cores using Verilog hardware description language from VHDL. The
redesigned processor cores will maintain the same hardware architecture, as well as
still be reconfigurable through a parameterized hardware design approach.
From the design analysis, two limitations were identified. One of these
affects the hardware implementation of the design in alternative logic devices and/or
ASIC technologies due to the instantiation of technology-dependent IP modules.
Another limitation is the abnormal output data behavior of the decompression
processor core when the hardware is processing highly redundant data, but functions
correctly when processing other data sets. Effectively, this means the design does
not meet its required specifications since its functional behavior cannot be assured.
16
Based on these design limitations, two enhancements are identified and
proposed. The first enhancement is designing generic hardware to replace the IP
cores used in the original design. With these generic modules, it is expected that the
design can be ported to alternative technologies, thus increasing its competitive
advantage for usage in commercial applications. The second enhancement involves
modifying the control module of the dual port memory used for storing the restored
data in the decompression processor core. The objective of this modification is to
prevent simultaneous reading and writing the same memory location, which is
observed to be the reason for the abnormal decompression output behavior
mentioned previously. By preventing this from happening, output of the dual port
memory, which forms the restored data, can be made predictable at all times, thus
resolving the uncertain output behavior of the decompression process.
The next step of the project work involves development of an embedded data
compression system, ported to an actual hardware prototyping platform, which will
be used for testing and verifying the compression and decompression processor core
designs in real-time operating conditions. The data compression system will be
based on ALTERA NIOS processor systems, employing its proprietary Avalon
memory-mapped interconnect fabric.
In this system architecture, the NIOS
processor acts as the system master, while both processor cores are embedded as
slave peripherals on the system bus. In addition, several other components required
for proper operation of the overall system are also integrated to the bus.
This data compression system development consists of two parts. The first
part involves designing a slave interface module that conforms to the Avalon bus
interconnect requirements. This is needed to properly integrate both compression
and decompression processor cores to the system bus because the Avalon
interconnect fabric employs specific interface protocols that ensure proper
communications and data transfers between all components attached to the system
bus. Without the slave interface module, the system controller would not be able to
communicate and control both processor cores to perform the required compression
and decompression tasks. The second part is to integrate all necessary modules to
17
build a functionally working system when ported to the hardware prototyping
platform of interest. This work is accomplished through the use of a software design
tool. Apart from the compression and decompression processor cores, the overall
system is made up of a processor that functions as the system controller, several
standard peripherals and device controllers, input/output ports, memory modules for
storing data and firmware codes, as well as the interconnecting bus. Once the design
is ready, the whole system is targeted to a programmable logic device populated on
the hardware prototyping platform of choice. In this project, the chosen hardware
platform is ALTERA Excalibur Development Kit, featuring NIOS soft core
embedded processor and APEXEP20K200E programmable logic device (Altera,
2000).
On the software development side, the embedded system firmware will be
developed using C programming language. The firmware is needed to execute both
compression and decompression processing by system controller based on the
specified handshaking protocols.
Once completed, the firmware will be used
primarily for testing the data compression system, running on the prototyping
hardware platform, to verify its functionality in real-time operating conditions.
2.2
Design Verification Strategy
Two approaches are proposed to validate the functionality of the
compression and decompression processor core Verilog designs. The first approach
involves verifying each RTL modules of both compression and decompression
processor cores to ensure no hardware functionality is lost during the VHDL-toVerilog redesign process. This is done through design simulations using EDA
software tool, both in functional and timing modes.
The second verification
approach makes use of the hardware evaluation platform, where the embedded data
compression system is ported.
This approach allows faster and easier design
verification because it can be performed in real-time.
18
2.2.1
Functional and Timing Simulations
In design simulation, every RTL models of the design is treated as “black
boxes”, which means all verification must be accomplished through the available
interfaces, without direct access to the internal states of the design, and without
knowledge of its hardware structure and actual implementation. This verification
approach forms a true conformance verification to show a particular design
implements the intent of a specification regardless of its implementation (Bergeron,
2001). The idea is to apply known input stimuli and outputs of the black-box are
compared to a known output stimuli to check for consistency. This it is achieved
through developing a verification environment (also known as test bench). Figure
2.1 shows the general concept of such test bench approach used in design
simulation.
Figure 2.1: Overview of test bench approach for design simulations
In this verification environment, the driver module provides input test stimuli
to the unit under test. Responses by the unit under test to the test stimuli are then
evaluated by the monitor module, which basically compares the results with known
output stimuli to determine whether the unit under test is functioning correctly.
Both the input and output test stimuli can be stored in external files and read into the
verification environment during design simulations.
In addition, the whole
verification environment can be developed to automatically perform the required
verification via the use of scripting languages. This is to ease verification efforts of
large and complex designs where a lot of test cases and input stimuli must be
validated.
19
Normally, this verification approach requires another equivalent model of the
design under test to be co-developed.
This model is known as simulation or
behavioral model of the RTL design. Both models are then subjected to the same
input test stimuli and their output responses are compared to each other. Since both
designs are functionally equivalent, their outputs should match. Else, the RTL
model is assumed to fail, although theoretically the behavioral model, with which
the RTL model is compared to, could be the failing design. Figure 2.2 shows this
verification approach.
In standard industry practice, separate design teams are
assigned to work on RTL and behavioral models of the design specifications to
minimize this risk. Essentially, this enables verification through cross-checking the
same design across different implementations, hoping that no similar design bug
occurs (statistically speaking) in both models.
Testbench Environment
RTL
Model
Driver
Module
Monitor
Module
Behavioral
Model
Figure 2.2: Concept of cross-checking between RTL and behavioral models used in
design verification.
In this project, developments of equivalent behavioral models of the
processor core designs are not part of its scope of work.
Therefore, design
verification using conventional test bench approach discussed above does not really
apply in this case. What is available, though, are RTL models of the original design
expressed in VHDL.
Naturally, a workaround of the test bench verification
approach is proposed, which is to treat the original design in VHDL as behavioral
20
model for comparing and validating functionalities of the compression and
decompression processor core design expressed in Verilog. In this case, the main
idea of test bench verification approach, which is cross-verification of the same
design through different implementations, is still partially valid. The reason being
both designs are developed by different designers expressed in different hardware
description languages. It must be mentioned, however, that a flaw still exists in the
proposed verification workaround approach.
This is because the RTL models
designed in Verilog maintains the same hardware architecture of original VHDL
design in order to preserve the design intent of reconfigurable and reusable
hardware.
As such, both design implementations are not actually different,
architecturally-speaking. The only solution to minimize the risks associated with
this flaw is to assume that the original design has been extensively verified, so that it
can be considered as a golden reference model. Fortunately, the VHDL design has
been functionally verified by comparing its results with a software implementation
of the compression and decompression processor cores (Yeem, 2002). Thus, the
assumption of using the original design as a golden reference model is valid, and it
will be the basis for design verification of Verilog RTL models.
Based on the above assumption, the design verification approach will be to
compare results of both VHDL and Verilog models using functional and timing
simulations. If the simulation results of both designs are the same, then it can be
concluded they are functionally equivalent.
A disadvantage of this approach,
however, is that a huge amount of test scenarios are needed to cover every possible
aspects of the design, thus leading to an increase in development time. However,
due to time limitation, a comprehensive design simulation cannot be done and a
compromise must be made. Therefore, design verification will be done through
simulating the Verilog RTL models using a finite set of test stimuli, and comparing
its results to the original VHDL design. It is assumed that the Verilog RTL design
of the compression and decompression processor cores meets its required functional
specifications, if the results of both designs match for all finite test stimuli used.
In practice, whenever hardware is redesigned from one language into
another, as the case in this project, the design equivalence checking approach used is
21
through formal verification, where the functionalities of both hardware designs are
verified through sophisticated mathematical algorithms. However, this project does
not have access to such formal verification tools; therefore equivalence checking
using this approach cannot be done.
After all the RTL modules of the compression and decompression processor
core design have been simulated and functionally verified to be equivalent with
original design, another design simulation approach is proposed for verifying the top
level module design.
An advantage of this project is both compression and
decompression processor cores are designed, where each core implements
complementary function of the other core. This means design verification can be
performed through a cascaded compression-decompression system approach. The
idea is to apply test stimuli to the inputs of the compression processor core, and then
to use the resulting compressed data outputs as inputs to the decompression
processor core. Finally, the restored data outputs of the decompression processor
core are compared with the original input test stimuli.
By performing data
compression and decompression in a cascade system like this, there should not be
any difference between the input test stimuli and the outputs produced by the
compression system if both processor cores are functionally working. Therefore,
this criterion should be sufficient to validate the top level module of both processor
core designs. Figure 2.3 summarizes this approach.
Figure 2.3: Verification environment setup for simulating design functionality of
compression and decompression processor core top level modules.
22
2.2.2
Real-time Hardware Testing
The design verification discussed in previous section is done in a software
simulation environment.
Although modern simulation tools do provide cycle
accurate and correct hardware modeling, their use for “at-speed” testing is limited to
the capability and performance of the machines with which these simulation tools
run. In the case of modern ASIC design, a software simulation running on a really
high-end, and correspondingly expensive, computer platform will be lucky to
achieve equivalent simulation speeds of more than a few Hz, that is, a few cycles of
the main system clock for each second in real time (Synplicity Inc, 2005).
Practically, this means that detailed software simulations can be performed on only
small portions of the design. Thus, in order to achieve high simulation speed, it is
necessary to use some form of hardware-assisted verification approach.
In view of this, a data compression system is developed which is then
targeted to a hardware prototyping platform. Details of this hardware evaluation
platform development will be presented in Chapter 5.
Using this evaluation
platform, the compression and decompression processor design can be verified in
real time, thus speeding up the design verification process.
Furthermore, the
hardware evaluation platform allows easy verification using larger sets of test
stimuli since the whole verification process can be automated. In order to utilize this
hardware evaluation platform, its test firmware needs to be developed, which will be
used by the system processor to execute the compression and decompression
processor core functions. In this project, the test firmware is developed based on the
cascaded compression-decompression system verification approach described in
previous section. Essentially, the system controller sends a set of source data to be
compressed by the compression processor core. The compressed data outputs are
then stored in the system internal memory.
Once all source data have been
processed, the stored compressed data are then sent to the decompression processor
core to be restored. The restored data outputs of the decompression processor core
are then compared with the original source data by the system controller to verify
whether both processor cores are functionally working or not. Results of this test
firmware can easily be viewed externally via some sort of debugging ports or
23
channels. The hardware evaluation platform uses on-board UART communication
port to provide this debugging facility.
2.3
Tools and Techniques
To successfully accomplish the objectives of this research project,
knowledge on several design tools and techniques are required. The following
sections present an overview of the required tools and techniques.
2.3.1
Verilog Hardware Description Language
Digital hardware design complexities have grown exponentially within the
last decade, where multi-million transistor designs are quite common.
This
exponential growth is fueled by new advances in design, as well as in fabrication
technology.
The usage of hardware description language to model, simulate,
synthesize, analyze, and test digital designs has been the basis of this rapid
development.
Verilog is one of the hardware description languages that are commonly in
use. It is designed to enable descriptions of complex and large designs in a precise
and concise manner. This ability spans the range from descriptions that express
conceptual and architectural design to detailed descriptions of implementations in
gates and transistors. It was designed to unify design process including behavioral,
RTL, gates, stimulus, switches, user-interface, testbenches, and unified interactive
and batch modes (Sagdeo, 1998). Verilog is a complete language that is used as a
formal method for describing digital designs, as well as performing hardware
simulation and synthesis. Verilog HDL is now used extensively for digital designs
in ASIC, FPGA, microprocessor and DSP implementations. In this project, Verilog
is chosen as the hardware design language of choice.
24
2.3.2
C Software Programming
C has been the de-facto programming language used by software developers
to develop software applications running on personal computers and commercial
servers, as well as embedded firmware drivers in many computing gadgets such as
cellular phones, CD players, household appliances, etc. Due to its wide popularity,
almost all the available commercial and educational software development tools
support software design written in C. As such, it is a natural choice for developing
the embedded system test firmware using this language due to the enormous
resources available.
2.3.3
ALTERA System-On-Programmable-Chip (SOPC) Builder Tool
SOPC Builder is a tool for composing bus-based systems out of library
components such as CPUs, memory interfaces, and I/O peripherals. SOPC Builder
can either import directly, or provide an interface to, user-defined blocks of logic.
SOPC Builder generates a single system module that instantiates a list of userspecified components and interfaces, as well as the bus interconnection logic
(Altera, 2003b).
SOPC Builder library components can be very simple blocks or they can be
complex, parameterized, dynamically-generated subsystems.
Many ALTERA
SOPC Builder library components include wizard-based graphical interfaces for
configuration, and HDL generator programs that can deliver the component as either
synthesizable Verilog or VHDL. SOPC Builder simplifies system-level generation
by generating files for both synthesis and simulation.
25
2.3.4
ALTERA Quartus II Software
The ALTERA Quartus II software is a comprehensive tool for system-on-aprogrammable-chip (SOPC) design.
It provides a complete, multiplatform design
environment that offers hardware description language (HDL) and schematic design
entry, compilation and logic synthesis, full simulation and worst-case timing
analysis, logic analysis and device configurations (Altera, 2003a). Quartus II design
software also offers a unified design flow for the development of FieldProgrammable-Gate-Array
(FPGA),
Complex-Programmable-Logic-Device
(CPLD), and structured Application-Specific-Integrated-Circuit (ASIC).
Its
benefits include accelerates design time, increases performance and productivity,
boosts functionality, and easily addresses potential design delays such as post placeand-route design changes
2.3.5
ALTERA Excalibur Hardware Development Kit
The Excalibur hardware development kit is designed as a prototyping
platform that provides hardware designers with an immediate and economical
solution to hardware and software co-design and co-verification (Altera, 2000). The
platform uses a high-density APEX EP20K200EFC484 programmable logic device
(PLD) on-board and it supports the development and debugging of ALTERA
microprocessor-based SoC designs by incorporating programmable logic with a
variety of memory devices, interface resources, and debugging capability. The
platform includes physical interfaces for widely used standard interconnects and a
full-featured, configurable memory subsystem ideal for prototyping embedded
systems and SOPC designs. The platform supports multiple clocks, and a wide
variety of high performance I/O ports for developing complex communication
systems.
26
2.4
Summary
The research approach and verification strategy, as well as tools and
techniques used in this project are presented. In the following two chapters, details
of the proposed design enhancements are described.
CHAPTER 3
DESIGN OF DATA COMPRESSION HARDWARE
This chapter discusses the details of data compression hardware design. Both
the compression and decompression processor cores, as well as their respective
interfacing modules are presented.
The designs are described in parameterized
Verilog hardware description language so that the hardware can be easily configured
to meet the constraints of resources, speed and compression saving, as well as to
interface with any memory-mapped host system.
3.1
Overview of the Compression Hardware Design
The hardware architecture of the compression core consists of an interface
block, called Compression_Interface, and the main compression hardware, called
Compression_Unit.
Compression_Interface is the controller responsible for
interfacing Compression_Unit to the host system. The general operation of the
compression hardware is as follows:
1. When compression core is ready to receive data, it collects source data from
the host system.
2. When host system is ready to receive data, the compression core sends the
compressed data to the host system.
3. The compression core status is monitored to determine its availability for
compressing source data.
Figure 3.1 shows the functional block diagram of the compression hardware.
28
clk
In_Readyn
reset
Interface_Data_Out_Empty
Interface_Data_In_WE
Poll_Amount_Out_Empty
Poll_Amount_Out
Poll_Amount_In_WE
Interface_Data_Out_RE
Compression_Interface
Poll_Amount_Out_RE
Poll_Amount_In
Valid_Amount_Out
Data_Out_Ready
Data_Restart
Valid_Bit
Symbol
Enable
Interface_Data_Out
Restart
Interface_Data_In
Compression_Unit
Figure 3.1: Functional block diagram of the compression hardware
In the following sections, design details of the compression unit and its submodules, as well as the compression interface are presented.
3.2
Design of Compression Unit
The main hardware module of the compression unit consists of three
hierarchical blocks, which are the LZSS coder, fixed Huffman coder and data packer.
All modules are synchronously clocked.
The LZSS coder performs the LZSS
encoding of the source data symbol, while the fixed Huffman coder re-encodes the
length of LZSS codeword to achieve better compression ratio. Finally, the data
packer packs the unary codes from the fixed Huffman coder into a fixed-length
output packet and sends it to the interfacing block. Design details of these blocks are
29
further elaborated in the following sub-sections. Figure 3.2 shows the block diagram
of Compression_Unit.
Figure 3.2: Block diagram of Compression_Unit
3.2.1
LZSS Coder
In order to achieve sufficiently high processing speed to obtain dataindependent throughput, and to use fixed-length shifter to reduce the hardware
resource utilization, the LZSS coder design employs systolic arrays architecture. The
hardware architecture is shown in Figure 3.3 below and it consists of four main
components; namely the dictionary, reduction tree, delay tree and codeword
generator sub-modules.
Input Symbol
Dictionary (Systolic Array)
Delay
Tree
Matching
Submodule
Reduction Tree
Match Result
Coding
Submodule
Codeword
Generator
LZSS Codeword
Delayed Input Symbol
30
Figure 3.3: Hardware architecture of the LZSS coder module
The functions of LZSS dictionary are to store the previous seen symbols of
source data and to compare every new input symbol with all symbols currently stored
in the dictionary. It is organized as systolic array made up of several processing
elements (PE). Each PE stores a copy of the source symbol being serially shifted
into the dictionary. Effectively, as the dictionary is updated every clock cycle, the
content of each PE shift one position leftward to the next PE and a new source
symbol is shifted into the first PE in the dictionary. This allows using fixed-length
shifter to send streams of source symbols into the dictionary one at a time. Figure
3.4 shows the connections between each PE in the dictionary. For N = DicLEVEL,
the systolic array dictionary consists of 2N processing elements.
S0
Enable
Restart
SN+1
PEN+1
LN+1
SN
PEN
SN-1
LN
PEN-1
SN-2
LN-1
Figure 3.4: Connection between PEs
Each PE consists of a register, a comparator and a counter. The register stores
previous seen symbol while the comparator compares a new source symbol with the
symbol that is currently stored in the register. The counter computes the maximum
length of matched phrase ending at that PE. It does this by incrementing the counter
whenever a new source symbol matches the symbol currently stored in the register.
Otherwise, the counter resets to zero. Therefore, the longest matched phrase can be
31
determined from the counter value e.g. counter resets to zero from L, therefore L is
the longest length. Figure 3.5 shows the architecture of a typical PE.
Current
symbol, S0
Symbol to
next PE
Register
Symbol from
previous PE
Comparator
Enable
Controller
Restart
Counter
Length, LN
Figure 3.5: PE hardware structure
General operation of the dictionary is described below:
1) For PEN, SN is the symbol stored.
2) In every cycle, SN is compared to a new source symbol, S0.
3) If there is a match, the counter increments to LN, where LN represents the
longest length of matched phrase ending at PEN. Else, the counter resets to
zero.
4) At the end of each cycle, LN is sent to the reduction tree. SN is shifted to the
next PEN+1 while SN-1 is shifted into PEN.
5) For the next input source symbol, repeat Step 1 to 4.
Based on the counter values from each PE, the reduction tree extracts the
longest matched phrase length and computes the position where it is detected. The
32
reduction tree consists of 2N–1 two-to-one multiplexers, where N = DicLEVEL and
each multiplexer selects the largest value between its two inputs. All multiplexers
are arranged in N layers with each layer consists of 2n multiplexers, n = 0, 1, … , N1. Because of this, there is a latency of N cycles before the output of reduction tree
for the current matched phrase comparison is available to the next sub-module. The
output of reduction tree module, called Match Result, consists of a pair of (length,
position) information. The length is the size of the current prefix that matches to the
sub-string in the dictionary, while the position represents the ending dictionary
position of the sub-string. Figure 3.6 shows the structure of the reduction tree
module.
Figure 3.6: Reduction tree hardware structure
The delay tree is used to compensate for the latency through both the systolic
array dictionary and reduction tree for each source symbol. It consists of N + 1 shift
registers, arranged in a linear fashion. Each shift register adds a delay of one clock
cycle to the source symbol, so that each input source symbol and its corresponding
Match Result output from the reduction tree enter the codeword generator module at
the same time. Figure 3.7 shows the structure of the delay tree module.
33
Input data
D
SET
Q
D
SET
Q
D
SET
Q
Output data
....
CLR
Q
CLR
Q
CLR
Q
clk
N + 1 registers
Figure 3.7: Delay tree hardware structure
The final module in the LZSS coder chain is the codeword generator. Its
function is to generate the appropriate LZSS codeword from the Match Result
information received. There are three possible codeword that could be generated.
The first one is a codeword that functions as a pointer when a proper matched phrase
prefix is found inside the dictionary. The codeword is represented as 1CLCP, where
CL (> 0) is the length of the found prefix, and CP is the ending position of a phrase in
the dictionary that matches the found prefix.
The second possible generated
codeword is when there is no match between the source symbol and any symbol in
the dictionary. In this case, a codeword that functions as an explicit symbol, e.g.
0CE, is generated, where CE is the corresponding input source symbol. The third
possible codeword is generated when the end of an input source data stream is
detected e.g. when Restart = ‘1’ is issued by the Compression_Interface. In this
case, a codeword that functions as an end marker is generated as a notification for
decoding process to end the data stream that it currently being decoded. The end
marker codeword is represented as 1CLCR, where CL is zero and CR represents the
total bit to be omitted during decoding process.
The codeword generator is
implemented as a finite state machine controller, whose behavior is described by the
ASM flowchart in Figure 3.8 below (Yeem, 2002).
34
Reset counter,
Mlen to zero
A
No
Collect input symbol,
restart, enable and
match result information
from delay tree and
reduction tree
Retart?
0
0
Enable?
1
1
Mlen > 0
Mlen > 0
Yes
Yes
Length >
Mlen?
Generate codeword that
functions as pointer,
store codeword that
functions as end marker
in pending register and
reset Mlen to 0
1
Store codeword
that functions as
end marker in
pending register
Length = 0?
Yes
Generate
codeword that
functions as
explicit symbol
Yes
Store codeword
that functions as
explicit symbol in
pending register
No
Yes
Mlen++
No
Generate codeword that
functions as pointer
Collect input symbol,
restart, enable and
match result information
from delay tree and
reduction tree
Generate codeword
from pending register
Retart?
No
Length = 0?
No
Mlen = 1
Yes
Store the codeword that
functions as explicit
symbol in pending
register and reset Mlen
to 0
0
Enable?
0
A
1
Length = 0?
No
Mlen = 1
Figure 3.8: ASM flowchart of codeword generator
35
3.2.2
Fixed Huffman Coder
As discussed earlier, Huffman coding technique is used to encode the length
segment of an LZSS codeword. However, for Huffman coding technique to function
correctly, both the coder and decoder modules must agree on the same encoding
table used. Normally, this technique is done as a two-pass approach. The first pass
collects frequency counts of the symbols in the data stream, followed by the
construction of a Huffman tree or encoding table, and transmission of the tree to the
decoder. The second pass encodes and transmits the data symbols themselves, based
on the static tree structure. However, a disadvantage of this approach is that it slows
down the compression processing due to the extra computations needed to build the
encoding table, and reduces the effective compression saving due to transmission of
the encoding table to the decoder. In this design, a predefined encoding table is used
to avoid slowing down the processing and transmitting the encoding table to the
decoder.
The predefined code table will represent the same encoding structure
regardless of the source data being compressed.
The Huffman encoding is applied only to the CL portion of LZSS codeword,
bypassing any other data, such as CP, CE, and CR. The encoding code table is
constructed by assigning lesser bits to shorter CL, and vice-versa. Effectively, in the
compression stage, the length of the LZSS codeword is compared to the predefined
code table to see the required encoded data to be output. In the decompression stage,
the same table-lookup technique is used, except the input is the encoded data and
output is the restored LZSS codeword length. Table 3.1 shows an example of the
encoding table defined for CL length of 6 bits.
Table 3.1: Example of Huffman encoding table
Length of LZSS
Codeword
000000
000001
00001X
0001XX
001XXX
Huffman code
Range
N/A
0
10X
110XX
1110XXX
Won’t happen
1
2-3
4-7
8-15
Number of Bits
Difference
N/A
-5
-3
-1
+1
36
01XXXX
1XXXXX
11110XXXX
111110XXXXX
16-31
32-63
+3
+5
Besides performing the Huffman encoding, the fixed Huffman coder also
computes the length of its generated codeword.
This information, called
Valid_Count, is used by the data packer to pack all variable-length Huffman
codeword into fixed-length output packets. In addition, the fixed Huffman coder
issues a control signal, called Last_Data = ‘1’, when it detects the LZSS codeword
end marker to notify data packer to send all valid data to the
Compression_Interface module.
3.2.3
Data Packer
The data packer consists of two registers and a sorting buffer. The output
data to be sent is stored in one of the two registers, whereas its total bit amount is
stored in the second register. During the packing process, the total bit count is used
by the sorting buffer to sort and to merge the output data with the unary codeword
from the fixed Huffman coder. The procedure to be done in this module is described
in Figure 3.9.
37
Unary codeword
received from fixed
Huffman coder
Sort and merge
codeword with output
data in sorting buffer
No
Update output data total
bit count i.e. count =
count + valid_count
Count <
2InterfaceWIDTH
Yes
Send fixed length output
data to interfacing block
Update output data total
bit count i.e. count =
count – 2InterfaceWIDTH
Figure 3.9: Operation of data packer
The decision of sending data to the interfacing block depends on a trigger
issued by the fixed Huffman coder discussed earlier. When triggered, a fixed length
output is generated if the total bit count exceeds or equal to 2InterfaceWIDTH. In this
condition, two sending cycles is needed. Else, the data packer would send all data in
the output register even though total bit count is less than 2InterfaceWIDTH.
38
There is a limitation reported for the data packer design. Normal operation
only allows one fixed length data to be sent to the interfacing block in every cycle.
However, there is a possibility that there could be more than one fixed-length data
available when the total bit count of the output data is greater than or equal twice the
2InterfaceWIDTH. Therefore, suitable design parameters must be chosen to overcome this
design limitation. For correct operation, the value of 2InterfaceWIDTH should not be less
than (1 + SymbolWIDTH), (2 * MAXWIDTH + DicLEVEL), and (1 + MAXWIDTH +
[log2(SymbolWIDTH)]) (Yeem, 2002). Please refer to Table 1.1 of Chapter 1 for
detailed
descriptions
of
SymbolWIDTH,
MAXWIDTH,
DicLEVEL
and
InterfaceWIDTH parameters.
3.3
Design of Compression Interface
The hardware architecture of the interfacing block is shown in Figure 3.10. It
consists of FIFO buffers to provide the infrastructure for data transfers between the
compression unit and external host systems, as well as control modules that ensure
proper operations of the interfacing protocols. The input controller extracts data
from input buffer to be sent to the compression unit, while output controller collects
and sends compressed data to output buffers.
39
Figure 3.10: Block diagram of the compression interface
During the input interfacing phase, input controller issues a control signal
called In_Readyn to inform host system about the status of the compression unit,
whether it can receive new source data or not. When In_Readyn = ‘0’, the controller
is ready to receive a series of new inputs from the host system. Two types of input
information are required; the actual source data, which are fed into the
Interface_Data_In FIFO, and the sizes of source data to be compressed expressed in
bits, which are fed into the Poll_Amount_In FIFO. The sending process of input data
from the host system should meet the following procedures:
1) Wait until In_Readyn = ‘0’
2) Send a series of fixed-length source data to Interface_Data_In FIFO.
3) Send the amount of total bits in the series of source data to Poll_Amount_In
FIFO. The total bit amount should not exceed PollAmount * 2InterfaceWIDTH. If
the amount less than 2InterfaceWIDTH, the input controller takes it as a
notification to end the current source data stream. Go back to Step 1.
For sending the source data to the compression unit for processing, the input
controller first checks the size of source data to be compressed. Based on the size,
source data are fed into the compression unit one symbol at a time. Once all source
data have been sent, the input controller issues a control signal called Restart to mark
the end of the input data stream. The input controller is implemented as a finite state
machine and its simplified state transition diagram is shown in Figure 3.11.
40
reset
NO
DATA
Poll_Amount_In_Empty = 0
CHECK
AMOUNT
Poll_Amount_In >=
2InterfaceWIDTH
Poll_Amount_In =
Poll_Count
NORMAL
PROCESS
(0 < Poll_Amount_In < 2InterfaceWIDTH) ||
(Poll_Amount_In = 0 && Valid_Count > 0)
Poll_Amount_In = 0
&& Valid_Count = 0
FINAL
PROCESS
SymbolWIDTH >=
Valid_Count
RESTART
Figure 3.11: State transition diagram of compression interface input controller
The output controller computes the total size in bits, called the
Poll_Amount_Out, of the data it receives from the compression unit, while the actual
compressed data are fed into the Interface_Data_Out FIFO when they are ready.
Normally, the Poll_Amount_Out FIFO would be updated with the necessary values
once the total bit count of the compressed data equals PollAmount * 2InterfaceWIDTH.
As such, external host systems can monitor the Poll_Amount_Out_Empty status
41
signal to know when compressed data are ready to be polled. The compressed data
polling process by host system should meet the following procedures:
1) Wait until Poll_Amount_Out_Empty = ‘0’.
2) Check Poll_Amount_Out value to know the size of compressed data to be
polled.
3) Collect the output data from Interface_Data_Out FIFO based on the amount
of Poll_Amount_Out. If Poll_Amount_Out is zero, the host system should
take it as a notification to end the current polling of compressed data stream.
Design details and Verilog source codes of the compression hardware and all
its sub-modules are available in Appendix C.
3.4
Overview of the Decompression Hardware Design
The hardware architecture of the decompression processor core consists of an
interface block, called Decompression_Interface, and the main decompression
hardware, called Decompression_Unit. Decompression_Interface is the controller
responsible for interfacing Decompression_Unit to the host system. The general
operation of the decompression hardware is as follows:
1) When decompression core is ready to receive data, it collects compressed
data from the host system.
2) When host system is ready to receive data, the decompression core sends the
restored data to the host system.
3) The decompression core status is monitored to determine its availability for
decompression process.
Figure 3.12 shows the functional block diagram of the decompression hardware.
Next_Interface_Data_In
Data_Out_Ready
Data_Restart
Data_Out
Interface_Data_In_Out
Interface_Data_In_Empty
42
Figure 3.12: Functional block diagram of the decompression hardware
In the following sections, design details of the decompression unit and its
sub-modules, as well as the decompression interface are presented
3.5
Design of Decompression Unit
Figure 3.13 shows the block diagram of the decompression unit. It consists
of data unpacker, fixed Huffman decoder, LZSS codeword FIFO buffer and LZSS
expander. All modules are synchronously clocked. The data unpacker unpacks
fixed-length compressed inputs from the interface block, where the unpacked data
would be decoded by the fixed Huffman decoder to restore the LZSS codeword. The
FIFO buffer stores LZSS codeword and feeds it to the LZSS expander, where it
43
restores the source symbols corresponding to the received LZSS codeword and sends
it to the interface block.
Figure 3.13: Block Diagram of Decompression_Unit
3.5.1
Data Unpacker
Data unpacker consists of a register and a sorting buffer. The register is used
to store the previous compressed input received from the interface block, while the
sorting buffer is used to sort and merge the previous input with the current
compressed input. The sorting and merging processes depend on the total amount of
valid bits in the sorting buffer. This valid bit amount is computed by the fixed
Huffman decoder. Figure 3.14 illustrates the data unpacking process.
Figure 3.14: Operation of data unpacker
44
3.5.2
Fixed Huffman Decoder
The fixed Huffman decoder is designed to perform Huffman decoding based
on a predefined encoding table, similar to the one used in the compression phase.
During the decoding process, a Huffman codeword would be extracted from the
unpacked data, and the validity of the extracted codeword is evaluated. If a valid
Huffman codeword is detected, the decoder restores the corresponding LZSS
codeword based on the predefined encoding table, and the codeword is written into
the external LZSS codeword FIFO.
In addition, the Huffman decoder also
determines the length of the extracted Huffman codeword, which is used to compute
the amount of valid bit parameter used by the sorting buffer of data unpacker
module.
The Huffman decoder module is implemented as a finite state machine
controller. Figure 3.15 presents an overview of its state transition diagram. Upon
reset, the controller is in the NO_DATA state, waiting for new input data to be
decoded.
Whenever data is available, it moves to UPDATING state to collect
unpacked data from data unpacker module. Next, it moves to the DECODING state,
where the actual Huffman decoding process takes place (Yeem, 2002).
45
Figure 3.15: State transition diagram of fixed Huffman decoder
3.5.3
LZSS Expander
The LZSS expander module performs the LZSS decompression process. Its
operation basically involves a table-lookup procedure, which is very fast and simple.
The received LZSS codeword is first analyzed to determine how the codeword
should be decompressed.
There are basically two types of LZSS codeword,
distinguished by a flag bit. A 0 flag bit indicates that the codeword is the original
source symbol, therefore decompression process only copies this data into the
dictionary, as well as outputs it as decompressed data. The second type of LZSS
codeword is the actual encoded data, represented by flag bit of 1.
The
decompression process determines the position and length of the codeword, extracts
the appropriate phrase in the dictionary and outputs them as the decompressed data.
At the same time, the outputs are again copied into the dictionary. Figure 3.16 shows
46
the hardware architecture of the LZSS expander, which consists of codeword
analyzer, dictionary, delay tree and symbol generator modules.
LZSS Codeword
Codeword
Analyzer
Flag & Explicit
Symbol
Offset_Req
Dictionary
Delay
Line
Dictionary
Data
Symbol
Generator
Restored Symbol
Figure 3.16: Hardware architecture of LZSS expander
The codeword analyzer function is two-fold; to determine when to receive a
new LZSS codeword from the LZSS codeword FIFO, and to provide the necessary
parameters and control information for subsequent modules in order to restore the
correct source symbols. The required signals are evaluated from the received LZSS
codeword. These consist of a flag bit, an explicit source symbol if flag bit is 0, as
well as the LZSS codeword dictionary offset value if flag bit is 1.
47
The codeword analyzer is implemented as a finite state machine controller.
Figure 3.17 presents an overview of its state transition diagram. The codeword
analyzer waits for the new LZSS codeword during the NO_DATA state and analyzes
the received codeword during ANALYZING. Codeword that functions as a pointer
(1CLCP) would be processed during POINTER_PROCESSING state (Yeem, 2002).
Figure 3.17: State transition diagram of codeword analyzer
The dictionary stores the previous restored symbols, while its output is
addressed by the offset parameter of LZSS codeword, called Offset_Req, issued by
the codeword analyzer module.
The dictionary design consists of a dual-port
memory to store the restored symbols, and a control module to determine the way
restored symbols are written into, and correct data are read from memory. Figure
3.18 shows the architecture of the dictionary design.
48
Offset_Req
Restored
Symbol
Dual port
RAM
Memory
controller
Enable
Restart
Dictionary
Data
Figure 3.18: Hardware architecture of decompression dictionary
Due to the latency in accessing dual-port memory, the dictionary output
would only be available one clock cycle after the offset parameter is issued by the
codeword analyzer. This delay needs to be compensated, so that the flag and explicit
symbol issued by the codeword analyzer enter the symbol generator at the same time
with the dictionary output. This is done by shifting the flag and the explicit symbol
into a delay tree, which effectively pipelines the signals by one clock cycle.
The restored data is generated by the symbol generator module. Its input
parameters are the explicit symbol and symbol from the dictionary output, as well as
the flag bit to select which symbol to be chosen as the restored data. Effectively, this
module implements a 2-1 multiplexer logic. In addition, it also has a control module
because the restored data need to be outputted at the correct times based on Enable
and Restart control signals. Figure 3.19 shows the hardware architecture of the
symbol generator module.
49
Figure 3.19: Hardware architecture of symbol generator
3.6
Design of Decompression Interface
The hardware architecture of the decompression interface block is shown in
Figure 3.20.
It consists of FIFO buffers to provide the infrastructure for data
transfers between the decompression unit and external host systems, as well as
control modules that ensure proper operations of the interfacing protocols.
Compared to the compression interface block, only three FIFO buffers are needed in
the decompression side. The Poll_Amount_In information is no longer required
since the decompression hardware is designed to determine the correct amount of
source symbols to be restored based on the compressed data.
50
Figure 3.20: Block diagram of the decompression interface
The input controller determines the availability of decompression hardware to
process new series of compressed input packets from external host systems. This is
done by checking the amount of data in the input data interface FIFO. If the data
amount is greater than the maximum amount decompression hardware can handle,
the input controller issues In_Readyn = ‘1’, which means the decompression
hardware is currently busy and the external host should stop sending new data to be
decompressed, and vice-versa.
On the other hand, the output controller performs similar function as its
cousin in the compression interface block, where both modules compute the size of
their corresponding output data i.e. Poll_Amount_Out. However, the decompression
interface output controller performs an additional operation of packing the restored
data from the decompression hardware into fixed-length output data packets, and
writing them into the Interface_Data_Out FIFO buffer.
The data packing
mechanism is similar to the one used by the data packer of compression processor,
but the difference is some of the output bits are omitted when the end marker for the
current restored data stream is detected.
51
Design details and Verilog source codes of the decompression processor core
and all its sub-modules are available in Appendix D.
3.7
Summary
Hardware designs of both compression and decompression cores, as well as
their respective interfacing modules are described. The cores are design based on
reconfigurable hardware approach, where the parameters specified determine the
hardware tradeoff in terms of resources, speed and compression savings. In the next
chapter, details of the proposed design enhancements are explained.
CHAPTER 4
DESIGN MODIFICATIONS & HARDWARE ENHANCEMENTS
Details of the proposed modifications and enhancements of the compression
and decompression processor core design are presented in this chapter. It starts by
discussing the issue of hardware portability, followed by detailed descriptions of the
generic dual port memory and first-in-first-out buffer design. The decompression
core design bug and its associated hardware patch are described as well.
4.1
Hardware Portability Issue
Initially, the proprietary high-speed data compression and decompression
processor core designs were targeted to an ALTERA FLEX10KE programmable
logic device, which forms part of a hardware evaluation system to verify and
evaluate the design functionality and performance in real-time. Naturally, the design
uses several technology-dependent modules of the target programmable logic device
to efficiently utilize its hardware resources.
In particular, the compression and
decompression processor core design uses two types of ALTERA Library of
Parameterized Module (LPM) embedded IP memories, which are the first-in-first-out
(FIFO) buffer and dual port memory. The FIFO is used as storage elements for
transferring data between both processor cores and external host system, while the
dual port memory is used to build the dictionary structure for the LZSS
decompression core.
53
Designs that make use of technology-dependent memory macros or cells are
incompatible to alternative target technologies. As a result, these designs cannot be
reused in a different technology, even if the logic part surrounding the memories can
be easily retargeted.
At a minimum, retargeting the design to alternative
technologies requires some design modifications. Normally, these modifications
involve replacing the IP memories with equivalent modules of the device or
technology of interest. In cases where an equivalent hardware is not available, core
redesign is necessary, which might be very expensive in terms of cost and time-tomarket depending on the complexity of the design.
However, avoiding these
technology-dependent cores altogether, especially in a reusable design, is not always
possible either.
In the special case of embedded memories, a technology
independent structure based on common register cells is too expensive to use, in
terms of chip area and production cost.
To avoid hardware redesign when porting to alternative implementation
technologies without abandoning the idea of reusing technology-dependent
embedded memories, this project proposes a conditional compilation approach for
instantiating embedded memories in the compression and decompression processor
core designs. The idea is to provide implementation alternative for the design, where
the designers can choose between generic or technology-dependent memory macros,
based on the available resources and constraints for the best compromise between
cost and time-to-market. Generally, the proposed design approach is illustrated by
the pseudo-code below:
`if (vendor IP memories are available)
Instantiate technology-dependent memory models
`else
Instantiate generic memory models
The generic memories are Verilog RTL models based on common register
structure that can be inferred and optimized by synthesis tools to provide another
level of flexibility to the design. Using this approach, designers can specify their
choice of using embedded IP memories to take advantage of the efficient resource
54
utilization, as well as lowering the development costs because the design is proven
on the target device. In cases where designers do not have the luxury of using IP
cores, then the generic RTL memory models can be instantiated but at the expense of
area and production costs.
4.2
Generic Dual Port Memory Design
Dual-port memory is a family of memory devices with two sets of data and
address busses. By connecting two processors to a dual port memory device, they
can communicate through the shared memory. This method is well known for
interconnecting inhomogeneous processors, e.g. a graphics controller to a main
processor. From a hardware point of view, dual-port memory has many attracting
advantages. Firstly, it does not require any special VLSI design, and secondly, it is
sufficient to use a single chip for each side of the connection, making the design very
compact and simple. Major application areas for dual port memories are cache
memories, first-in-first-out buffers, interface registers, register files and video
memories.
4.2.1
Design Specifications
The dual port memory used in original design is a synchronous two-clock
memory, with separate clocks for read and write, as well as separate read and write
addresses and enable signals that allows simultaneous reading and writing.
In
addition, the design utilizes one side of the memory as write-only port, while another
side as read-only port. This means the dual port memory only has one input data to
the write port and one output data from the read port. Another design requirement is
the data size and address depth of the dual port memory must be customizable
because the design intent is to have a configurable and reusable hardware.
55
4.2.2
Hardware Architecture
Based on the above specifications, the dual port memory design will be based
on parameterized Verilog HDL. This means width of the address and data lines can
be easily modified to generate memory models with the required address space and
data size, thus ensuring the design configurability requirement is met. The functional
behavior of the dual port memory design is modeled as two dimensional arrays using
register file structure, where any word can be accessed using index into the array.
Using this structure, data can be written into the memory by comparing its write
address with the appropriate index of the array. Similar mechanism is also used for
reading data from the memory. Furthermore, writing and reading the dual port
memory will be synchronous to the write and read clock, respectively.
Figure 4.1 shows the block diagram of the dual port memory hardware
architecture, which captures the essence of the design specifications. Detailed design
and Verilog source code of the dual port memory can be found in the Appendix
section E.1.
56
D
datain
SET
CLR
D
SET
CLR
wr_enb
wr_addr
Write
address
decoder
D
SET
CLR
Q
Q
Q
Q
Q
Q
.
.
.
.
.
D
SET
CLR
dataout
Read
address
decoder
rd_enb
rd_addr
Q
Q
Figure 4.1: Hardware architecture of the dual port memory
4.3
Generic Synchronous FIFO Design
Synchronous FIFO has a single clock port for both read and write operations.
Data is written into the next available empty memory location on a rising clock edge
when the write enable is high, while data is read in the order in which it was written
by asserting read enable prior to rising clock edge. The full control signal indicates
that no more empty locations remain and the empty control signal indicates that no
more data resides in the FIFO memory.
57
4.3.1
Design Specifications
The FIFO used in the original design is a synchronous one-clock FIFO, where
data are written to and read from based on the same clock. This is the simplest FIFO
implementation because it avoids the more challenging issues of asynchronous
clocking. In addition to the full and empty control signals normally associated with
FIFO, the design requires another control signal that indicates how many data the
FIFO is currently storing.
This information is needed by the design to inform
external host systems about the processor core readiness in receiving new data
inputs. Furthermore, the FIFO design must be made configurable as well to fulfill
the overall design intent of reconfigurable hardware.
4.3.2
Hardware Architecture
Based on the above specifications, the FIFO design will also be based on
parameterized Verilog HDL to ensure the design configurability requirement is met.
Both writing and reading are synchronous which means the respective enable signals
and data must meet setup time with respect to the clock. It uses the generic dual port
memory described in previous sections as the storage elements.
Other hardware logics of the FIFO design include two counters for generating
the write and read addresses each time data is written to and read from the FIFO.
Effectively, these addresses point to the next memory location where data will be
written to if a write enable is asserted, or where data will be read from if a read
enable is asserted. In addition, another counter is used to keep track of the data count
inside the dual port memory, where the information, called USEDW, is used to
indicate the status of the FIFO to external logic. The USEDW counter increments
whenever a write occurs while FIFO is not full and decrements whenever a read
occurs while FIFO is not empty. If both write and read occur at the same time, the
58
counter does not change its value because the data that is being written is offset by
the data that is being read, thus there is no change in the data count inside the
memory. If a write occurs while FIFO is full or a read occurs while FIFO is empty,
overflow or underflow condition occurs. In overflow case, data count cannot be
more than the memory storage space, therefore USEDW counter cannot increment
when writing happens while FIFO is full. Meanwhile, a negative data count value
does not represent the right state of the FIFO, therefore USEDW counter cannot
decrement when reading happens while FIFO is empty in the underflow case.
Besides being used to indicate the FIFO status to external logic, USEDW is
also used to generate the corresponding full and empty control signals of the FIFO.
Whenever USEDW equals the maximum address count of memory, full is asserted.
Similarly, empty is asserted whenever USEDW is equal to zero. Full and empty
control signals must be interpreted by external logic to prevent a write operation
when FIFO is full, or a read operation when FIFO is empty. Else, overflow or
underflow condition occurs, where the data transferred are lost.
Figure 4.2 shows the block diagram of the synchronous FIFO hardware
architecture, which captures the essence of the design specifications. Detailed design
and Verilog source code of the synchronous FIFO can be found in the Appendix
section E.2.
59
clock
reset
Dual port memory
wr_enb
wr_addr_ctr
clk
rst
count
wr_clk
rd_clk
wr_enb
rd_enb
wr_addr
rd_addr
rd_enb
rd_addr_ctr
clk
count
enb
datain
rst
enb
wr_data
rd_data
dataout
full
USEDW counter
empty
clk
rst
inc
usedw
usedw
dec
Figure 4.2: Hardware architecture of the synchronous FIFO
4.4
Hardware Bug of Decompression Processor Core Design
LZSS decompression process is achieved through a table-lookup procedure
from a dictionary structure. During each iteration of the decompression process, the
LZSS codeword is first analyzed to determine how it should be restored by checking
the flag bit status. If the codeword represents an explicit symbol, the source symbol
is restored directly from the codeword. Else, the source symbol is restored from the
dictionary based on the offset and length of the analyzed LZSS codeword. At the
end of each iteration, the decompression dictionary is updated by shifting in the latest
restored symbols. This whole process is performed in one clock cycle, which means
reading the dictionary to output the restored source symbols, and writing the restored
symbol to update the dictionary happen simultaneously.
This requires the
dictionaries to be constructed using a dual port memory, with separate read and write
clocks, addresses and enable signals.
60
In original design, reading of the decompression dictionary is done all the
time, with read address being calculated by subtracting the offset of the LZSS
codeword from the write address i.e. read_address = write_address – LZSS_offset.
This means while the restored symbol for current data is being written, a memory
read for outputting the next restored symbol is occurring at the same time.
There are two situations how the dual port memory is utilized in the
decompression processor core design.
The first situation involves simultaneous
reading and writing from different memory locations. This mode of operation occurs
for cases where LZSS codeword offset are non-zero, which normally occurs when
the source data do not contain sufficiently high redundant information. In this mode,
behaviour of dual port memory is always predictable.
Figure 4.3 shows the waveform of the dual port memory behaviour in the first
mode of operation. In the first cycle, denoted by (a) in Figure 4.3, both read and
write operations of the dual-port memory are taking place as indicated by the
rd_enable = ‘1’ and wr_enable = ‘1’ control signals. However, different memory
locations are being accessed by these read and write operations, as indicated by
rd_address = 2 while wr_address = 3. In the next clock cycle, denoted by (b) in
Figure 4.3, output of the dual-port memory due to the read operation earlier is
predictable and valid as shown in the figure i.e. data_out = 16’h6233.
61
(a) (b)
Figure 4.3: Behavior of dual port memory for simultaneous read and write accesses
on different memory locations.
When compressing highly redundant source data, the generated LZSS
codeword normally has offset value of zero and length value equals to the maximum
length specified. In this case, read address of the decompression dictionary equals
the write address. Therefore, the second operating mode of the dual-port memory
involves simultaneous reading and writing from the same memory location. In
general, when a read occurs on the same address while a write operation in progress
(write not completed), there is a possibility for having unknown output from the
memory. To prevent this potential contention, the read operation should not start
until the write operation is completed. For this to occur, the read operation should
62
not be activated for a minimum amount of time specified by the maximum-writecycle time of the memory.
Figure 4.4 shows the waveform of the dual port memory behaviour in this
second operating mode of the original decompression processor core design. In the
first cycle, denoted by (a) in Figure 4.4, simultaneous read occurs while a memory
write is in operation as indicated by the rd_enable = ‘1’ and wr_enable = ‘1’ control
signals.
And since the offset of the LZSS codeword is zero, as indicated by
Offset_Req = 0 in Figure 4.4, both read and write addresses of the dual-port memory
are the same, as indicated by rd_address = 2 and wr_address = 2. When this
happens, the output of the dual-port memory in the next clock cycle, denoted by (b)
in Figure 4.4, due to the read operation is undefined i.e. data_out = ‘x’. In actual
hardware, the undefined memory output can take on any value, which almost always
resulted in incorrect restored symbol output of the decompression processor core.
63
(a) (b)
Figure 4.4: Behavior of dual port memory for simultaneous read and write accesses
on same memory locations.
The requirement of preventing read operation for a minimum amount of time
specified by the maximum-write-cycle-time presents a challenge when retargeting
the dual-port memory design in different hardware implementation technologies.
This is because dual port memories targeted in different technologies handle
simultaneous read and write operations slightly different, where the maximum write
cycle time differs from memory to memory.
For example, the decompression
processor core may gives correct output using dual port memory targeted in
technology A, but does not function correctly when using dual port memory of
technology B.
64
4.5
Details of Hardware Patch
In view of the issue described in the previous section, this research project
proposes a design enhancement to disallow simultaneous read and write on the same
memory location, so that contention would not occur and the output of dual port
memory will always be predictable across different technologies. This enhancement
requires
modifications
to
the
Expander_Codeword_Analyzer
and
Expander_Dictionary sub-modules of decompression processor core design.
The function of Expander_Codeword_Analyzer sub-module is two-fold; it
determines when to receive a new LZSS codeword from the interface FIFO buffer
and figures out the way to retrieve the correct source symbol, either explicitly from
the codeword or from the LZSS decompression dictionary. It is the latter part that is
modified to provide separate read and write access time to the dictionary memory.
This sub-module is implemented as a finite state machine controller and its
simplified state diagram is shown in Figure 4.5 below (Yeem, 2002).
65
Reset
No
Data
If codeword not functions as
pointer &&
(LZSS_Codeword_Empty = ‘1')
If (LZSS_Codeword_Empty = ‘0')
Analyzing
If (Count = 1) &&
(LZSS_Codeword_Empty = ‘1')
If codeword functions as pointer
If (Count = 1) &&
(LZSS_Codeword_Empty = ‘0')
Pointer
Processing
Figure 4.5: Simplified state transition diagram of the Expander_Codeword_Analyzer
sub-module.
In original design, memory write occurs when the state machine is in the
Analyzing and Pointer_Processing states. Based on the above state diagram, this
means the controller issues write request most of the time while LZSS codeword is
available, as indicated by LZSS_Codeword_Empty = ‘0’. And because read is done
all the time by the Expander_Dictionary sub-module, simultaneous reading and
writing of the dual port memory happens. The first design modification is to prevent
this from happening, by introducing another state, called Read_Memory, to the state
machine design of the Expander_Codeword_Analyzer sub-module. Essentially, this
adds one clock period in between two successive write access requests, where the
Expander_Dictionary sub-module can use this additional free period to perform a
memory read. Effectively, reading and writing of the dual-port memory now occurs
66
at different times. Figure 4.6 shows the updated state diagram of the improved
design.
Reset
If codeword not functions as
pointer &&
(LZSS_Codeword_Empty = ‘1')
No
Data
If (Count = ‘1’) &&
(LZSS_Codeword_Empty = ‘1')
If (LZSS_Codeword_Empty = ‘0')
Pointer
Processing
Analyzing
If (Count = ‘1’) &&
(LZSS_Codeword_Empty = ‘0')
If codeword functions as pointer
Read
Memory
Figure 4.6: Updated state transition diagram of the improved
Expander_Codeword_Analyzer sub-module design.
Another design modification involves changing the way how the
Expander_Dictionary sub-module requests memory read operations. In this respect,
two modifications are done. The first one is to change memory read request from
reading all the time to only read after memory writes occur. This is simply done by
delaying the write request for one clock cycle. Because read is now done in the next
clock cycle after write operation, while at the same time write address has been
incremented by one, the dual port memory produces an incorrect restored data output
because read memory location is now offset by one from the intended address.
67
Therefore, the second modification done on this sub-module is to change the read
address generation logic to take into account the offset of correct memory location
mentioned earlier i.e. read_address = write_address – LZSS_offset – 1.
Figure 4.7 shows the waveform of the dual port memory behavior after the
design modifications have been implemented. In the first clock cycle, denoted by (a)
in Figure 4.7, it shows that only memory write operation is taking place, as indicated
by rd_enable = ‘0’ and wr_enable = ‘1’, respectively. Read operation now only
occurs one clock cycle after write operation takes place, denoted by (b) in Figure 4.7.
Furthermore, both conditions show that read and write addresses (denoted by
rd_address and wr_address, respectively) differ from one another even though the
value of LZSS offset (denoted by Offset_Req) is zero due to high redundancies in
source data. This concludes simultaneous reading and writing of the same memory
location, when the decompressed data has sufficiently high redundant contents, has
been eliminated. It will be shown later, in the design simulation results of Chapter 6,
that the decompression processor core hardware bug has been resolved by these
design modifications. Therefore, the proposed design improvement is successfully
achieved.
Design details and Verilog source codes for the improved
Expander_Codeword_Analyzer and Expander_Dictionary sub-modules are available
in Appendix D.
68
(a) (b)
Figure 4.7: Behavior of the decompression processor core dual port memory after
implementation of design improvement.
4.6
Summary
This chapter discusses the issue of design portability to alternative
technologies, as well as details of technology-independent FIFO and dual port
memory designs. In addition, details of the decompression core hardware bug and
the design modifications done to solve it are also explained. In the next chapter,
development of the compression system hardware evaluation platform is described.
CHAPTER 5
DATA COMPRESSION SYSTEM HARDWARE DEVELOPMENT
In this chapter, development work of the embedded data compression system
hardware evaluation platform is described.
An overview of the overall system
architecture is presented, followed by the procedures needed to build and integrate
the system using CAD tool, as well as the necessary steps to port the design onto the
target programmable logic device of choice. Next, design details of the memorymapped bus slave interface module are explained. Finally, development of the test
firmware to be used with the hardware evaluation platform is described.
5.1
Data Compression System Hardware Architecture Overview
Figure 5.1 shows the overall architecture of the data compression system
design. It is based on ALTERA NIOS processor system designs, and it utilizes the
proprietary Avalon memory-mapped system bus architecture. The whole system is
implemented onto a single programmable chip; thus this approach is also known as
System-on-a-Chip (SoC) design. The target hardware prototyping platform is the
Excalibur hardware development kit.
70
LZSS
Compression
Core
NIOS Processor
LZSS
Decompression
Core
BOOT
ROM
SRAM
Controller
Timer
Flash
Controller
LCD
Controller
UART
GPIO
Figure 5.1: Overall architecture of the data compression system on a chip design
The hardware development board comes with several external chips and
peripheral devices are standard components. Therefore, in order for the system
processor to have full access to these components and to properly build a fully
functional system, their respective control modules must be integrated into the
system design. These components include the boot ROM for storing the necessary
system’s booting up instructions, flash and SRAM memory controllers for
communicating with the respective on-board memory chips, UART device
controllers for enabling communications with external host systems and for
providing the debugging infrastructure, LCD controller for interfacing with external
LCD monitor, as well as timer units for the system scheduling purposes.
71
Furthermore, the hardware development kit also has several switches, LED and pushbuttons. These are simple I/O devices that can be used to provide external control
information to the system, such as reset. Also, the LED can be used to provide visual
indication of certain system status or operation, so they are quite useful for hardware
engineers and system developers.
Therefore, general-purposes parallel I/O
controllers are also needed as part of overall system to enable access to these I/O
devices.
To complete the data compression system development, both compression
and decompression processor cores are integrated as slave peripherals onto the
system bus.
Since these are custom hardware, the system processor needs to
“understand” the communication protocols of both processor cores so that data
transfers can be properly exchanged between them. This is achieved though the
Avalon bus slave interfaces, which perform protocol translation between the system
processor and both slave peripherals.
5.2
System on Programmable Chip Development
As mentioned in the previous section, the data compression system consists
of several hardware components such as the processor, both compression and
decompression processor cores, memories, timers, communication ports as well as
input/output interfaces.
These hardware components are interconnected into a
complete system by means of an interconnection network provided by the Avalon
memory-mapped bus architecture.
Since all components of the system can be
implemented on the programmable logic device by using hardware description
language programming, a knowledgeable user could write such codes to implement
the whole system manually (Synplicity Inc, 2005). However, this would be a very
tedious and time-consuming task. An alternative solution is to use computer aided
design software tool to implement the desired system, by simply choosing the
required components and specifying the parameters needed to make each component
fit the overall system requirements.
72
In this project, the latter approach is used to develop the data compression
system using a CAD software tool called ALTERA SOPC Builder. The tool allows
defining hardware structure and generating system design easily through the use of a
graphical user interface (GUI). The GUI allows adding hardware components to the
system, configuring them and specifying how they are connected together. Once all
components have been added and the required parameters have been specified, the
SOPC Builder tool generates the system interconnect fabric and outputs the HDL file
of the top-level system module. Then, the system module can be compiled and
synthesized into the target programmable logic device using another CAD tool called
ALTERA Quartus II.
Majority of the hardware components used to build the system are ALTERA
intellectual property cores, which are supported as standard library modules by the
SOPC Builder tool. Therefore, integrating these components is easily done since the
tool recognizes them. However, to integrate custom hardware components, such as
the compression and decompression processor cores, they must be designed such that
the SOPC Builder tool recognizes them. This is achieved through the following
design flow:
1) Define and write the HDL file describing the Avalon interface for physical
connection of the custom components.
2) Use SOPC Builder tool to specify the interface and package the custom
hardware as an SOPC Builder component.
3) Instantiate the custom component in the same manner as other SOPC Builder
components. Since the custom hardware is now packaged as an SOPC
Builder component, it can also be reused in any SOPC Builder systems.
In addition to the generated HDL file for the top-level system module, SOPC
Builder also generates a header file that defines the address mapping of each
peripheral component that is connected to the embedded NIOS processor. This
header file is useful for writing device drivers for the peripherals to be executed by
the system processor.
Please refer to “NIOS Hardware Development Tutorial”
73
document (Altera, 2007) for a detailed step-by-step procedure of using SOPC
Builder tool to build a complete NIOS-based system.
In the following two sections, design details of the Avalon memory-mapped
slave interface modules for connecting the compression and decompression
processor cores to the system, as well as development of the system test firmware are
described.
5.3
Avalon Memory-Mapped Slave Interface Design
The data compression system employs ALTERA proprietary Avalon bus
slave mechanisms to integrate any custom hardware as peripherals onto the system
bus. This mechanism uses a typical memory-mapped bus structure, which means the
peripherals are treated similar to standard memory modules. Therefore, to fully
access resources on the peripherals, proper control enables and unique address must
be defined for each of the peripherals, and the appropriate write or read data must be
captured during the correct time intervals.
The Avalon system bus is a very flexible memory-mapped structure, where
different types of peripherals with different protocol requirements can be
accommodated by the interconnect fabric.
In this project, the chosen Avalon
specifications are the slave interface mechanism using read and write transfers with
one fixed wait-state. This specification employs a single pipeline access, which
means data are valid one clock cycle after read or write operation is requested.
Based on this mechanism, an interface module is design to translate the bus protocols
into instructions that are understood by the slave peripherals.
Figure 5.2 shows an overview of the chosen Avalon bus slave mechanism for
a read transfer. Based on the figure, the first cycle of the read transfer starts on the
rising edge of the system clock, as denoted by (a). Next, read enable and address
signal asserted by the host system to the slave peripheral are valid, as denoted by (b).
74
The Avalon interconnect fabric then decodes the received address and asserts chip
select signal to the appropriate peripherals, as denoted by (c). The next rising edge
of clock marks the end of the first and only wait-state cycle. At this point, the
selected slave peripheral captures the required control signal and address
information, as denoted by (d). The slave peripheral then presents a valid data on the
read port during the second cycle, as denoted by (e). The host system then captures
the read data during the next rising edge of clock and the read transfer ends, as
denoted by (f). The next cycle begins here and could be the start of another transfer.
Figure 5.2: Description of Avalon bus slave interface read transfer with one fixed
wait-state mechanism
Figure 5.3 shows an overview of the chosen Avalon bus slave mechanism for
a write transfer. Based on the figure, the first cycle of the write transfer starts on the
rising edge of the system clock, as denoted by (a). Next, write enable, address and
data signals asserted by the host system to the slave peripheral are valid, as denoted
by (b). The Avalon interconnect fabric then decodes the received address and asserts
chip select signal to the appropriate peripherals, as denoted by (c). The next rising
edge of clock marks the end of the first and only wait-state cycle. All signals
asserted by the host system remain constant, as denoted by (d). The selected slave
peripheral captures the asserted control signal and address information by host on or
before the next rising edge of clock, as denoted by (e). The next cycle begins here
and could be the start of another transfer.
75
Figure 5.3: Description of Avalon bus slave interface write transfer with one fixed
wait-state mechanism
Figure 5.4 shows the overview of the Avalon bus slave interface module
architecture i.e. LZSS_Avalon_Interface. It consists of two sub-modules, which are
LZSS_Interface_Register and LZSS_Interface_Controller.
The function of the
register sub-module is to map the addresses sent by the host system to the necessary
functions to be carried out by the slave peripherals.
These functions include
preparing the appropriate data or control signals to be sent to the slave peripherals, or
obtaining the appropriate data or status information from the slave peripherals.
Meanwhile, the controller sub-module ensures data transfers between the host
processor and both peripherals are properly executed.
Figure 5.4: Overview of the Avalon bus slave interface module
76
5.3.1
Interface Register Sub-module Design
Table 5.1 describes the address mapping of the LZSS_Interface_Register submodule.
Using the specified address map, the host can fully control the slave
peripherals to perform the required tasks.
For example, to acquire the slave
peripheral status, the host processor configures the required address based on the
address map, which in this case is Address[2:0] = 3’b011. The interface register
sub-module then samples and decodes the specified address. Based on the address
map, it knows the host processor is acquiring the slave peripheral status. As such,
the interface register sub-module captures the slave peripheral status information and
presents them on the read data port of the system bus. The host processor then
simply captures the read data port during the following clock cycle. The status
information of the slave peripheral is represented by the four least significant bits of
the read data signal; therefore the host processor needs to perform the proper bit
masking to decode the required status bits. Details of the slave peripheral status bit
mapping will be described later in this section.
Table 5.1: Address mapping of the slave peripheral implemented by the interface
register sub-module
Address
Operation
3’b000
LZSS_Handshaking_Control <= Writedata[3:0]
Reset_LZSS <= Writedata[4]
3’b001
LZSS_Interface_Data_In <= Writedata[31:0]
3’b010
LZSS_Poll_Amount_In <= Writedata[12:0]
3’b011
Readdata[3:0] <= LZSS_Status
3’b100
Readdata[31:0] <= LZSS_Interface_Data_Out
3’b101
Readdata[12:0] <= LZSS_Poll_Amount_Out
Similarly, to instruct the slave peripheral to perform its tasks, the host
prepares the required control instruction onto the write data port and then configures
the specified address. The interface register sub-module then samples the write data
and decodes the required control. For example, assume the host processor wants to
77
perform a peripheral reset.
The host first prepares the corresponding control
instruction to perform this task and place it on the write data port of the bus. Next,
the host processor configures the required address to access the control register of the
slave peripheral, which in this case is Address[2:0] = 3’b000, based on the address
map. The interface register sub-module then samples this address to determine the
peripheral task requested by the host processor.
Since the host processor is
requesting a peripheral reset, the interface register sub-module captures the required
control bits of the write data and decodes the appropriate control signals to be sent to
the slave peripheral to perform reset i.e. Reset_LZSS = ‘1’. The control information
for the slave peripheral is represented by the five least significant bits of the write
data signal; therefore the host processor needs to perform the proper bit masking to
issue the required control bits. Details of the slave peripheral control bit mapping
will be described later in this section
Based on the description in the above examples for acquiring status and
controlling the slave peripheral, similar read and write mechanisms are also used for
transferring data between the host and the slave. The next section describes detailed
operation of the data transfer requests between host and slave.
There are five control signals generated by the interface register sub-module.
One of them is Reset_LZSS which is sent directly to the slave peripheral for
performing device reset as mentioned previously. The other four signals are sent to
the interface controller sub-module, which will generate the required slave peripheral
control signals based on the specified system bus handshaking protocols. Figure 5.5
describes the bit mapping structure of the interface register control signals which
needs to be properly masked by host processor when issued on the write port of the
system bus.
Figure 5.5: Bit mapping structure of interface register control signals
78
The status control signals of the slave peripheral i.e. LZSS_Status, is a 4-bit
wide signal.
The least significant bit of the status signal is used to convey
information about the peripheral availability, while the following two bits are used to
convey the availability of certain information processed by the slave peripheral. The
most significant bit of LZSS_Status is to convey the status of the data transfer task
requested by the host processor. Figure 5.6 describes the bit mapping structure of the
slave peripheral status signals.
Figure 5.6: Bit mapping structure of LZSS_Status signal
5.3.2
Interface Controller Sub-module Design
The function of the LZSS_Interface_Controller sub-module is to ensure all
data transfer requests from the host processor to the slave peripheral are properly
executed. By doing this, the data sent to or received from the slave peripheral is
guaranteed to be valid before new transfer requests are issued. The design of this
sub-module is modelled after a finite state machine controller and its overall state
diagram is shown in Figure 5.7.
79
Reset
PowerUp
Rst
data_fetch = ‘1’
result_fetch = ‘1’
poll_amount_out = ‘1’
poll_amount_in = ‘1’
Ld_Data
Ld_Poll_
Amount_
In
Ld_Poll_
Amount_
Out
Ld_Result
else
Done
data_fetch
poll_amount_in
poll_amount_out
result_fetch
||
||
||
Figure 5.7: State machine diagram of the LZSS_Interface_Controller sub-module
There are basically four types of data transfers that can take place between
the host processor and the slave peripheral. Two of these transfers involve writing
information to be processed by the slave peripheral, while the remaining two types
involve reading information processed by the slave peripheral. All transfer requests
follow similar processing steps implemented by the LZSS_Interface_Controller submodule. The differences are only with the type of information transferred, as well as
the required control signal to be asserted to execute the required transfer request.
For a write data transfer request, the host processor first prepares the data to
be written and places it on the write data port of the system bus, along with the
required address. It then issues the corresponding write enable signal accompanied
by the correct address, to the interface module in order to trigger the slave peripheral
to capture the data. Finally, the host processor clears the write enable signal as
indication that the write transfer request is completed. This is accomplished by
80
issuing a write clear signal accompanied by the correct address to the interface
module. Similarly, for a read data transfer request, the host processor issues the read
enable signal accompanied by the correct address to trigger the slave peripheral to
send some data. Once the data has been read, the host processor clears the read
transfer request by issuing a read clear signal accompanied by the correct address. In
both types, new transfers should not be initiated until the slave peripheral has
finished processing the current transfer; therefore the host processor needs to poll
status of the peripheral after each transfer has been requested. Figure 5.8 and 5.9
describe the steps taken by the host processor to initiate a write and read data
transfer, respectively.
Start new write
data transfer
request
Prepare data on write data port and issue
the correct address based on address map
i.e. Address[2:0] = ‘001’ or ‘010’
Issue the required write enable signal and
address i.e. Writedata[3] = ‘1’ or
Writedata[2] = ‘1’ and Address[2:0] = ‘000’
Issue the required write clear signal and
address i.e. Writedata[3] = ‘0’ or
Writedata[2] = ‘0’ and Address[2:0] = ‘000’
No
Write data
transfer
done?
Yes
81
Figure 5.8: Write data transfer request by host system
Start new read
data transfer
request
Issue the required read enable signal and
address i.e. Writedata[1] = ‘1’ or
Writedata[0] = ‘1’ and Address[2:0] = ‘000’
No
Read data
prepared by
slave?
Yes
Yes
Issue the required read clear signal and
address i.e. Writedata[1] = ‘0’ or
Writedata[0] = ‘0’ and Address[2:0] = ‘000’
No
Read data
transfer
done?
Figure 5.9: Read data transfer request by host system
As discussed in the previous section, the interface register decodes the
address issued by the host processor and asserts the corresponding handshaking
control signal to the interface controller sub-module. Depending on the received
handshaking control signal, the controller state machine moves to the appropriate
state. For example, if the received handshaking signal is Poll_Amount_In, the state
machine moves from PowerUpRst to Ld_Poll_Amount_In state, and so on. While in
the new state, the interface controller asserts the required control signal to the slave
peripheral and immediately moves to the Done state one clock cycle later, effectively
82
providing a one-pulse enable signal that triggers the slave peripheral to perform the
required tasks. In the Done state, the interface controller waits until host processor
clears the current data transfer, upon which it moves to the original idle state. Based
on this mechanism, data transfers between the host and slave are properly handled.
Design details and Verilog source codes for compression and decompression
processor core Avalon slave interfaces are available in Appendix F and G,
respectively.
5.4
Firmware C Programming
Both compression and decompression processor cores require specific
handshaking protocols to be followed by the host system to ensure correct device
operation. To perform compression, the required procedures are as follows (Yeem,
2002):
1) Check the status of compression processor core i.e. In_Readyn. If In_Readyn
= 1, go to Step 4, else go to Step 2.
2) Send a series of source data to be compressed i.e. Interface_Data_In.
3) Send the size of input source data to be compressed in bits i.e.
Poll_Amount_In. The Poll_Amount_In value should not exceed PollAmount
* 2InterfaceWIDTH.
4) Check
the
availability
of
compressed
data
from
the
core
i.e.
Poll_Amount_Out_Empty. If Poll_Amount_Out_Empty = 1, go to Step 1, else
go to Step 5.
5) Check the value of the compressed data size in bits i.e. Poll_Amount_Out.
6) Collect the compressed data from the core i.e. Interface_Data_Out based on
the amount specified by the Poll_Amount_Out value. Go to Step 3.
Similarly, these procedures must be followed to perform data decompression
(Yeem, 2002):
1) Check the status of decompression processor core i.e. In_Readyn. If
83
In_Readyn = 1, go to Step 3, else go to Step 2.
2) Send a series of data to be decompressed i.e. Interface_Data_In, where its
size in bits should not be more than PollAmount.
3) Check the availability of decompressed data from the core i.e.
Poll_Amount_Out_Empty. If Poll_Amount_Out_Empty = 1, go to Step 1, else
go to Step 4.
4) Check the value of the decompressed data size in bits i.e. Poll_Amount_Out.
5) Collect the decompressed data from the core i.e. Interface_Data_Out based
on the amount specified by the Poll_Amount_Out value. Go to Step 3.
Based on the specified interfacing procedures, firmware for testing the
compression system hardware is developed using C programming language. The
firmware will be run on the embedded NIOS processor to control both compression
and decompression processor cores. It is written to first compress a fixed set of
either random or highly redundant data. The resulting compressed data are then used
as input stimuli to the decompression processor core to simulate the decompression
process. Finally, the original source and decompressed data are compared to verify
the functionality of the compression system.
Figure 5.10 and 5.11 show the
processing steps of the firmware for compression and decompression process,
respectively. The complete C source code is available in the Appendix H.
84
Start compression
process
End compression
process
Compress
core
ready?
Yes
No
Yes
Write one
Interface_Data_In
into compression
core
No
All source data
have been
compressed?
Yes
No
Write process
done?
Interface_Data
_Out_Empty?
Yes
Yes
Increment source
data counter
No
No
Read process
done?
No
No
Remaining
source data <=
PollAmount?
Yes
Read compressed
data i.e.
Interface_Data_Out
No
Yes
All remaining
source data
have been
written?
PollAmount
source data
have been
written?
Poll_Amount_
Out_Empty?
Yes
Yes
Write size of
source data in bits
Poll_Amount_In
Read process
done?
No
No
No
Write process
done?
Read size of
compressed data
Poll_Amount_Out
Yes
Yes
All source data
have been
written?
No
Compression
done?
No
Yes
Write end marker
Poll_Amount_In =
0
Yes
No
Write process
done?
Figure 5.10: ASM flowchart of the compression process firmware design
85
Start decompression
process
Decompress
core ready?
Yes
No
Write one
Interface_Data_In into
decompression core
Poll_Amount_
Out empty?
Yes
No
No
No
Write process
done?
Yes
Increment input
data counter
Read size of
decompressed data
Poll_Amount_Out
Yes
Check amount of next
input data
No
Read process
done?
All remaining
input data have
been written?
Yes
No
Read decompressed
data i.e.
Interface_Data_Out
No
Read process
done?
No
Yes
Increment output
data counter
All data per
Poll_Amount_
Out have been
read?
Yes
All data have
been fully
decompressed
?
Yes
End decompression
process
Figure 5.11: ASM flowchart of the decompression process firmware design.
86
5.5
Summary
Development of the data compression system hardware evaluation platform is
presented in this chapter. The procedures of building the overall system using CAD
tool, as well as design details of the slave interface module are explained.
In
addition, development of the test firmware is also described. The next chapter
presents the results of design verification and system hardware test, as well as
analyzes its performance compared to original implementation.
CHAPTER 6
DESIGN SIMULATION AND HARDWARE TESTING
This chapter reports the results and findings from simulations and hardware
testing of the compression and decompression processor core design.
6.1
Design Simulation
The first part of the verification phase is to perform design simulation in
software environment to verify hardware functionality of the data compression core
design.
The simulation is done both in functional and timing modes and is
performed using a finite set of input test stimuli.
6.1.1
Test Setup
Three types of test scenarios are carried out in this verification phase. The
first test is to verify functionality of the Verilog design by comparing its outputs to
the original VHDL design. Since the VHDL design is assumed to be the functional
reference model, results from this test scenario can be used to verify whether the
Verilog design meets its functional requirements. As discussed in Chapter 2, we can
conclude the Verilog design is functionally working if it produces the same outputs
compared to the VHDL design for all the test stimuli used.
88
Once the functionality of the Verilog design has been validated, we can now
use it as the reference model to validate functionality of the generic dual port
memory and FIFO designs. Therefore, in the second test scenario, all test stimuli
used in the first scenario is rerun on two separate hardware setups; hardware using IP
memories and hardware using generic memories. If results from both hardware
setups are the same, we can then conclude the generic memory designs meet its
functional requirements.
The final test scenario is to verify whether the proposed hardware patch to
solve the decompression core design bug is acceptable or not. The test stimulus used
in this setup is a highly redundant source data. It is first compressed using the
compression core and then using the compressed outputs as inputs to the
decompression core. The restored data are then compared with the original source
data to validate the functionality of the decompression behavior.
Since both HDL designs are configurable, the same hardware parameters are
used so that the generated hardware of both HDL designs is architecturally
equivalent. The chosen hardware parameters used in the simulation are listed in
Table 6.1. Please refer to Table 1.1 of Chapter 1 for detailed descriptions of each of
the parameter to understand their effects on the generated hardware in terms of
resources, speed and compression saving trade-offs.
Table 6.1: Hardware parameters for compression and decompression processor core
design simulation
Hardware Parameter
Value
SymbolWIDTH
16
DicLEVEL
7
MAXWIDTH
5
InterfaceWIDTH
5
PollAmount
64
89
6.1.2
Test Result of Compression Processor Core
The compression core of both HDL designs is simulated using the same test
stimuli and their output data are then compared to each other. It is observed that
compressed data and its corresponding size in bits produced by the Verilog design
are the same as produced by the VHDL design. This concludes the functionality of
the compression processor core expressed in Verilog is correct and the design
complies with its required specifications.
Next, the same test stimuli are rerun on the Verilog compression core, one for
design using the IP memories from ALTERA and another for design using the
generic memory modules.
The simulation results show that outputs of the
compression processor core employing either generic or IP memory modules are the
same for all test stimuli used. This means design functionality of generic dual port
memory and FIFO is verified.
The third test scenario described in previous section is not carried out for
compression processor core because the purpose of the test is to prove the hardware
bug of the decompression processor core has been solved. No test result will be
presented in this case.
Details of the input test stimuli used in the compression processor core
verifications and its corresponding simulation output waveforms can be found in
Appendix I.
6.1.3
Test Result of Decompression Processor Core
The decompression core of both HDL designs is simulated using the same
test stimuli and outputs of the core are compared between both HDL
implementations. The simulation results show that outputs of both HDL designs are
90
the same for all test stimuli used. Again, this concludes that Verilog design of the
decompression processor core is functionally correct and it complies with the
required specifications.
Next, the same test stimuli are rerun on the decompression core, both for
design using IP memory modules as well as design using the generic memories.
Again, simulation results show that outputs of the decompression processor core
employing either generic or IP memory modules are the same for all test stimuli
used. This concludes design functionality of generic dual port memory and FIFO for
the decompression processor core is verified.
For the third test scenario, the decompression core of VHDL design is first
simulated using a highly redundant test stimulus which has been compressed earlier.
The simulation result shows that the restored data do not match the original source
data, which proves that there is a hardware bug in the VHDL design. Next, Verilog
design of decompression core, using either IP or generic memories, is simulated
using the same test stimulus. Based on the simulation results, the restored data of the
decompression processor core now match the original source data bit-by-bit, which
proves the proposed hardware patch has solved the decompression processor core
design bug.
Details of the input test stimuli used in the decompression processor core
verifications and its corresponding simulation output waveforms can be found in
Appendix I.
6.2
Hardware Testing
In the second verification approach, functionality of the compression and
decompression processor core is verified using the hardware evaluation platform.
Using this hardware platform, verification of the design can be carried out in real-
91
time. This means larger test data sets can be used since response of the hardware test
is very fast.
6.2.1
Test Setup
Hardware testing is done using two sets of fixed data of variable sizes. The
first data set consists of random stream of symbols, generated using a pseudo-random
bit sequence generator with an arbitrary polynomial. The idea is to have a source
data with randomly distributed redundancies and using this data set to evaluate the
functionality and performance of the data compression system. The second data set
consists of highly redundant stream of symbols. Its use is two-fold; to evaluate the
functionality and performance of the data compression system and to ensure bug fix
of the decompression processor core is not broken. It is expected that the data
compression system performs better when processing the redundant data set
compared to the random data set because very optimized LZSS codeword is
generated if the source data have a lot of redundancies in them.
The use of fixed data set is dictated by the unavailability of a data
compression software application that can be used as an interface to the embedded
data compression system, say from a host PC for example. Using the application
software, it is much easier to change between different sets of test data via the
standard graphical user interface and browsing of the computer file systems.
However, the scope of this project does not include developing such data
compression software application.
It is recommended that this application be
developed in the future.
Ten test stimuli with variable data sizes from each random and redundant
data set are used to evaluate the design. Design functionality is validated if the
outputs of the data compression system match the test stimuli used bit-by-bit since
hardware testing is done as a compression-decompression cascaded system.
In
92
addition, compression ratio for each of the hardware test is determined to evaluate its
performance. Compression ratio is calculated using the following formula:
Compression Ratio = Size of compressed data / Size of source data ……. (1)
Based on the above formula, the smaller the compression ratio means the
better the compression performance of the system. Also, by definition, compression
is only achieved if the compression ratio is less than 1, which means compression
performance is reduced as the ratio approaches 1. For ratio greater than 1, the
compression system is no longer practically useful because the size of the
compressed data is larger than the source data it is supposed to compress.
6.2.2
Test Result of Data Compression System
Results of the data compression system hardware testing are summarized in
Table 6.2. It passes all the performed tests, which means the data compression
system design complies with its required functional specifications.
In terms of
compression performance, it is observed that the compression ratio achieved by the
system when processing highly redundant data is very small, which means higher
performance.
However, when processing random data sets, its compression
performance significantly degrades, as shown by the compression ratio of greater
than 1. This means the compressed data is actually larger than the original source.
Table 6.2: Results of data compression system hardware test
Data Set
Size of Input
Data (Bits)
Random
Random
Random
Random
Random
Random
Random
Random
Random
320
480
640
800
960
1120
1280
1440
1600
Size of
Compressed
Data (Bits)
340
510
670
840
1010
1180
1350
1520
1690
Compression
Ratio
Test Result
1.0625
1.0625
1.0469
1.0500
1.0521
1.0536
1.0547
1.0556
1.0563
Pass
Pass
Pass
Pass
Pass
Pass
Pass
Pass
Pass
93
Random
Redundant
Redundant
Redundant
Redundant
Redundant
Redundant
Redundant
Redundant
Redundant
Redundant
1760
320
480
640
800
960
1120
1280
1440
1600
1760
1860
42
42
55
57
57
68
72
72
83
87
1.0568
0.1313
0.0875
0.0859
0.0713
0.0594
0.0607
0.0563
0.0500
0.0519
0.0494
Pass
Pass
Pass
Pass
Pass
Pass
Pass
Pass
Pass
Pass
Pass
Upon further analysis, it is observed that the random test data used in the
hardware test do not contain any redundancies at all. This means no match is found
in the dictionary and hence, no LZSS codeword is generated. Remember that the
LZSS compression algorithm adds an extra flag bit to all explicit symbols if no
match is found which effectively means all source data are now being represented by
an extra bit after the compression process. Because of this, the size of compressed
data produced by the system is slightly larger than the source data. This behavior is
normal for LZSS compression technique.
Details of the test stimuli used and results of the hardware test described in
this section can be found in Appendix J.
6.3
Performance Analysis
In this section, hardware performance of the design is evaluated and
benchmarked against its original implementation. It starts with discussion on the
applied performance metrics, followed by the analysis of the performance
benchmarking results.
94
6.3.1
Performance Metrics
Performance of the Verilog HDL design of the compression and
decompression processor core is analyzed based on two main hardware performance
metrics. The first metric is the area evaluation, where the amount of logic elements
and memory bits required by the targeted programmable logic device used to
implement the design is determined. It provides a measure of the required hardware
cost when fabricating the design.
The second performance metric is the evaluation of the hardware data
throughput, which is the time required to compress source data or to restore
compressed data. The required formula to calculate throughput is:
Throughput, (TP) = f * SymbolWIDTH …………………………… (2)
where f is the maximum operating frequency of the design implemented on a
particular device, and SymbolWIDTH is the size of each source symbol (Yeem,
2002).
Benchmarking the compression ratio is not done. This is because the scope
of this project does not include any enhancements in terms of improving its
compression performance. Therefore, it is expected the compression ratio of the
Verilog design to remain the same as originally reported, where the average
compression ratio is 2.3:1 for text data, 7.9:1 for binary data, and 8.6:1 for image
data (Yeem, 2002).
6.3.2
Performance Comparison
Table 6.3 and 6.4 summarize the benchmarking results of the performance
analysis of both HDL designs for compression and decompression cores,
respectively. The performance metric figures, as described in the previous section,
95
are obtained through porting both HDL designs onto the same programmable logic
device, which in this case is the APEX20KE FPGA. In addition, the same hardware
parameters are also used in both designs in order to facilitate fair comparison. Refer
Table 6.1 above for details of the hardware parameters used.
Table 6.3: Benchmarking results of compression processor core performance
between VHDL and Verilog design
Logic Element
Memory Bit
Frequency (MHz)
Throughput (mbps)
VHDL
7426
23040
61.39
982
Verilog
7429
23040
60.88
974
Table 6.4: Benchmarking results of decompression processor core performance
between VHDL and Verilog design
Logic Element
Memory Bit
Frequency (MHz)
Throughput (mbps)
VHDL
1248
26112
47.64
762
Verilog
1232
26112
53.38
854
Based on the performance metric results, there is no appreciable advantage
between both HDL designs for the compression processor core. Both area evaluation
and data throughput performance metrics are almost similar. This is expected since
the proposed enhancements for the compression processor core only involves
replacing the IP memories with generic ones, but the size of both memories are the
same. No other design modifications are done.
For the decompression processor core, it is also observed that no appreciable
improvement in the area evaluation of both HDL designs due to same reason above.
However, there is about 12% improvement in the maximum operating frequency and
hence, data throughput of the Verilog RTL design. This is probably due to the
hardware patch designed to solve the abnormal decompression behavior, where
simultaneous reading and writing the same memory location of the decompression
dictionary is eliminated. Thus, the tighter functional constraint of the decompression
processor core is now relaxed, which enables the synthesis tool to further optimize
96
the design and resulted in considerable increase of the maximum operating frequency
of the hardware.
6.4
Summary
Results of design verification and hardware performance metrics are
presented. Based on the simulation and hardware testing done, it is concluded that
the Verilog design of the compression and decompression processor core complies
with its required functional specifications.
In addition, hardware bug of the
decompression processor core has been solved, where the core can now decompress
data with any level of redundancies. Furthermore, there is about 12% improvement
on the maximum operating frequency of the decompression core, which translates to
higher data throughput that can be processed.
Meanwhile, no appreciable
improvement is observed on the area evaluation when implementing the design on
the target device. This result is expected since the Verilog design maintains the
original architecture to preserve the design intent of configurable hardware. In the
next chapter, the conclusion of this project is presented
CHAPTER 7
CONCLUSIONS
This chapter concludes the findings of this project and discusses the potential
future work to further enhance the compression and decompression processor core
design
7.1
Concluding Remarks
A proprietary high-speed data compression core design is analyzed. It is
found that the design has two limitations. The first limitation is the use of several
technology-dependent IP modules that prevents hardware implementation in other
alternative technologies.
The second limitation is the abnormal decompression
processor core behavior while processing highly redundant data, which effectively
renders the design unusable because it does not meet the required functional
specifications.
This research project proposes design enhancements to overcome the
identified limitations.
The first enhancement is designing generic modules as
alternative replacement to the technology-dependent IP cores and employing
conditional compilation approach to select the best compromise between the
available design resources and constraints. The second enhancement is redesigning
the dual-port memory controller logic of the decompression processor core
dictionary to prevent simultaneous writing and reading of the same memory location
98
to eliminate the unpredictable behavior of the decompression dictionary output.
With these enhancements, the design can now be ported to different logic devices
and technologies, and both its compression and decompression results are
functionally correct for any level of data redundancies.
A complete data compression system and its associated test firmware are also
developed that form the hardware evaluation platform.
Using this evaluation
platform, functionality of the design running on real hardware is proven.
7.2
Recommendation for Future Work
There are several potential future works that can be carried out to further
enhance the compression processor design.
The following section details the
background and proposed solutions of the potential future works.
7.2.1
Improvement of LZSS Codeword Generation Module
There is a limitation in the compression processor core design which
effectively reduces its compression saving performance. This limitation is due to the
way LZSS codeword generation logic of the compression processor core generates
the LZSS codeword. In the design, when a match with length greater than zero is
detected, an LZSS codeword that functions as a pointer to the dictionary elements is
generated i.e. CLCP, where CL is the length of the match, and CP is the position of the
matched phrase in the dictionary. The module only verifies whether a match is
detected or not (CL > 0), without confirming whether the detected matched phrase
would benefit from encoding. This is because certain system requirements might
restrict coding benefit for matched phrase of length greater than a certain value,
which is not necessarily zero i.e. CL > x, where x ≠ 0.
99
This limitation is best described through an example. Assume the following
system requirements; each source data symbol is 8 bits wide and the specified LZSS
compression hardware uses 13 bits to build the dictionary and 5 bits to limit the
maximum length. Since an LZSS codeword is represented as position-length pair,
this requirement resulted in each LZSS codeword to be represented by 18 bits (13 + 5
bits), while each source data is represented by 8 only bits. In this example, to encode
matched phrase of length one or two symbols, 10 and 2 extra bits are used,
respectively. As a result, the overall compression performance might degrade since
more bits are used to represent the same information. In extreme cases, the encoded
data might actually take up more bandwidth than the source data, which defeat the
purpose of compression in the first place.
The proposed improvement to the LZSS codeword generation module is to
include a decision-making logic that determines whether a matched phrase, detected
by the LZSS coder, would benefit from encoding or not. The decision is based on
the specified hardware requirements, namely the width of input symbol, size of the
dictionary used and the maximum matched phrase length allowed.
Given any
dictionary size (D) and maximum matched length (M), LZSS codeword requires
1+log2D+log2M number of bits for encoding.
The first bit is a flag used to
distinguish between encoded and original data. Therefore, given the width of each
source symbol (S), a matched phrase of length (L) detected in the dictionary would
benefit from encoding, if and only if, 1+log2D+log2M ≤ S*L. Otherwise, encoded
data take up more space than source data, which would reduce compression
performance.
7.2.2
Employing Adaptive Huffman Coding Technique
Another design limitation, which restricts the compression saving
performance of the design, is the use of predefined encoding table used by Huffman
coder and decoder modules. The encoding table was defined based on assumption
that length of an LZSS codeword is not uniform, where shorter codes occur more
100
frequently than longer ones (Yeem, 2002). Therefore, fewer bits are assigned to
short codes, and vice-versa.
For example, one bit is used to represent LZSS
codeword with length 1, three bits are used to represent codeword with lengths 2 or
3, five bits are used to represent codeword with lengths 4 to 7, and so on. This
approach essentially removes the entropy information of source data, which results in
suboptimal performance of the Huffman coding technique.
To improve this design limitation, it is recommended that an adaptive
Huffman coding module be developed to replace the predefined code table of
original design. One such technique was proposed by Faller, Gallagher & Knuth,
known as Algorithm FGK, and Vitter, known as Algorithm V (Lelewer and
Hirschberg, 1987).
Adaptive Huffman coding is a one-pass algorithm for
constructing Huffman codes, which means symbols are encoded “on the fly”. An
initial pass to determine the symbol frequencies necessary for computing an optimal
tree, as in conventional Huffman coding, is not allowed. Instead, coding is based on
dynamically varying Huffman tree.
Using this technique, both the coder and decoder maintain equivalent
dynamically varying Huffman trees, and the coding is done in real-time. The tree
that the coder uses to encode the t+1 symbol in the message (and that the decoder
uses to reconstruct the t+1 symbol) is a Huffman tree for the first t symbols of the
message. Both the coder and decoder start with the same tree and thereafter stay
synchronized by using the same algorithm to modify the tree after each symbol is
processed, thus always having equivalent copies of the tree. Due to this, neither the
tree nor its modification needs to be transmitted, unlike the case for conventional
Huffman coding. The processing time is proportional to the length of the symbol’s
encoding, so that the processing can be done in real time.
REFERENCES
Abke J., Barke, E., Heeke, M., Kannemacher, D. RIG: Targeting Designs With
Embedded Memories to FPGA and ASIC Technologies. Institute of
Microelectronics System, University of Hanover and Sican GmbH, Hanover,
Germany.
Actel Corporation. (1997). RTL Register-Based Memory Implementations. Actel
Corporation.
Altera Corporation. (2000). Excalibur Development Kit With Nios Embedded
Processor. Altera Corporation.
Altera Corporation. (2002). Designing With ESB in APEX II Device. Altera
Corporation.
Altera Corporation. (2003a). Introduction to Quartus II. Altera Corporation.
Altera Corporation. (2003b). SOPC Builder Data Sheet. Altera Corporation.
Altera Corporation. (2003c). Nios Embedded Processor 32-Bit Programmer’s
Reference Manual. Altera Corporation.
Altera Corporation. (2006). Avalon Memory-Mapped Interface Specification. Altera
Corporation.
Altera Corporation. (2007). NIOS Hardware Development Tutorial. Altera
Corporation.
Bergeron, J. (2000). Writing Testbenches Functional Verification of HDL Models.
(4th Ed.). Norwell, Massachusetts : Kluwer Academic Publishers
102
Chen, D., Chiang, Y. J., Memon, N., Wu, X. (2006). Alphabet Partitioning
Technique for Semi-Adaptive Huffman Coding of Large Alphabets. Technical
Report TR-CIS February 2006. Department of Computer and Information
Science, Polytechnic University, NY, USA.
Cummings, C. E. (2002). Simulation and Synthesis Techniques for Asynchronous
FIFO Design. Synopsys User Group Conference (SNUG). March 2002, San
Jose, CA, USA, 1-20.
Cummings, C. E, Alfke, P. (2002). Simulation and Synthesis Techniques for
Asynchronous FIFO Design with Asynchronous Pointer Comparisons.
Synopsys User Group Conference (SNUG). March 2002, San Jose, CA, USA,
1-16.
Feldman, D. (1998). A Brief Introduction to Information Theory, Excess Entropy and
Computational Mechanics. College of the Atlantic, ME, USA.
Lelewer, D. A, Hirschberg, D. S. (1987). Data Compression. ACM Computing
Surveys. 19(3), 261-296.
Lempel, A. Ziv, J. (1977). A Universal Algorithm for Sequential Data Compression.
IEEE Transaction on Information Theory. 23, 337-342.
Liu, L. Y, Wang, J. F., Wang, R. J., Lee, J. Y. (1995). Design and Hardware
Architectures for Dynamic Huffman Coding. IEEE Proceedings on Computer
and Digital Technology. 142(6), 411-418.
Pylak P. (1996). Efficient Modifications of LZSS Compression Algorithm. Faculty
of Mathematics and Science, Catholic University of Lublin, Lublin, Poland.
Sagdeo, V. (1998). The Complete Verilog Handbook. (1st ed). Hingham, M. A, USA:
Kluwer Academic Publisher.
Shannon, C. E. (1948). A Mathematical Theory of Communication. The Bell System
Technical Journal. 27, 379-656.
103
Vitter, J. S. (1987). Design and Analysis of Dynamic Huffman Codes. Journal of the
Association of Computing Machinery. 34(4), 825-845.
Vitter, J. S. (1989). Algorithm 673 Dynamic Huffman Coding. ACM Transaction on
Mathematical Software. 15(2), 158-167.
Yeem, K. M. (2002). Design of a Data Compression Embedded Core for High-Speed
Computing Applications. Master of Engineering (Electrical). Universiti
Teknologi Malaysia, Skudai.
APPENDIX A
EXAMPLE OF LZSS COMPRESSION ALGORITHM
Examples of the encoding and decoding process based on LZSS compression
algorithm are presented in this appendix.
A.1
LZSS Encoding Example
1) Given a source stream of symbols, S = {A, A, B, A, B, B, C}, where the size of
the source stream, l(S) = 7
2) Let the content of the dictionary, D = {K, K, K, K, K, K, K, K}, where the size
of the dictionary, l(D) = 8 and the predefined symbol is K.
3) Let LS = 7, it is therefore n = 8 + 7 = 15
4) Initially, the integer h0 = 0, and the content of B0 is:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
K
K
K
K
K
K
K
K
A
A
B
A
B
B
C
5) LZSS encoding begins:
a. For i = 1,
i. Determine L(p), where j = n – LS = l(D) = 8
L(p) = 0, p = don’t care
ii. Determine the codeword,
B0 (9, 9) = {A}, therefore C1 = 0A
h1 = 1
iii. The content of B1 is:
105
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
K
K
K
K
K
K
K
A
A
B
A
B
B
C
-
b. For i = 2,
i. Determine L(p),
B1(8, 8) = B1(9, 9) = {A}, therefore L(p) = 1, p = 8
ii. Determine the codeword,
C2 = 171
h2 = 2
iii. The content of B2 is:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
K
K
K
K
K
K
A
A
B
A
B
B
C
-
-
c. For i = 3,
i. Determine L(p),
L(p) = 0, p = don’t care
ii. Determine the codeword,
B2 (9, 9) = {B}, therefore C3 = 0B
h3 = 3
iii. The content of B3 is:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
K
K
K
K
K
A
A
B
A
B
B
C
-
-
-
d. For i = 4,
i. Determine L(p),
B3(7, 8) = B3(9, 10) = {A, B}, therefore L(p) = 2, p = 7
ii. Determine the codeword,
C4 = 162
h4 = 5
iii. The content of B4 is:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
K
K
K
A
A
B
A
B
B
C
-
-
-
-
-
106
e. For i = 5,
i. Determine L(p),
B4(8, 8) = B4(9, 9) = {B}, therefore L(p) = 1, p = 8
ii. Determine the codeword,
C5 = 171
h5 = 6
iii. The content of B5 is:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
K
K
A
A
B
A
B
B
C
-
-
-
-
-
-
f. For i = 6,
i. Determine L(p),
L(p) = 0, p = don’t care
ii. Determine the codeword,
B5 (9, 9) = {C}, therefore C6 = 0C
h6 = 7
iii. The content of B6 is:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
K
A
A
B
A
B
B
C
-
-
-
-
-
-
-
6) LZSS encoding done
a. Summary of the codeword:
C1 = 0A
C2 = 171
C3 = 0B
C4 = 162
C5 = 171
C6 = 0C
107
A.2
LZSS Decoding Example
1) Given a compressed stream of symbols, C = {0A, 171, 0B, 162, 171, 0C}
2) Let the content of the dictionary, D = {K, K, K, K, K, K, K, K}, where the size
of the dictionary, l(D) = 8 and the predefined symbol is K.
3) Initially, the content of D0 is:
1
2
3
4
5
6
7
8
K
K
K
K
K
K
K
K
4) LZSS decoding begins:
a. For i = 1,
i. C1 = 0A
ii. The content of D’1 (same as D2) is:
1
2
3
4
5
6
7
8
K
K
K
K
K
K
K
A
iii. S1 = {A}
b. For i = 2,
i. C2 = 171
ii. The content of D’2 (same as D3) is:
1
2
3
4
5
6
7
8
K
K
K
K
K
K
A
A
iii. S2 = {A}
c. For i = 3,
i. C3 = 0B
ii. The content of D’3 (same as D4) is:
1
2
3
4
5
6
7
8
K
K
K
K
K
A
A
B
iii. S3 = {B}
108
d. For i = 4,
i. C4 = 162
ii. The content of D’4 (same as D5) is:
1
2
3
4
5
6
7
8
K
K
K
A
A
B
A
B
iii. S4 = {A, B}
e. For i = 5,
i. C5 = 171
ii. The content of D’5 (same as D6) is:
1
2
3
4
5
6
7
8
K
K
A
A
B
A
B
B
iii. S5 = {B}
f. For i = 6,
i. C6 = 0C
ii. The content of D’6 (same as D7) is:
1
2
3
4
5
6
7
8
K
A
A
B
A
B
B
C
iii. S6 = {C}
5) LZSS decoding done
a. Summary of the restored string:
S1 = {A}
S2 = {A}
S3 = {B}
S4 = {A, B}
S5 = {B}
S6 = {C}
APPENDIX B
EXAMPLE OF HUFFMAN CODING
This appendix explains Huffman coding technique through an example. The
idea behind Huffman coding is to use shorter bit patterns for more common
characters and longer bit patterns for less common characters.
First of all, Huffman coding requires entropy or statistical information of the
source data to be encoded.
This entropy information may be expressed as
probabilities, frequency counts or other appropriate values. For example, assume we
want to encode the letters A (0.12), E (0.42), I (0.09), O (0.30), U (0.07), listed with
their respective probabilities.
The procedures below describe Huffman coding
technique:
1) Consider each of the letters as a symbol with its respective probability.
2) Find two symbols with the smallest probability and combine them into a new
symbol with both letters by adding their probabilities. There may be a choice
between two symbols with the same probability. If this is the case, either symbol
can be chosen. The final tree and codes will be different, but the overall
efficiency of the code will be the same.
3) Repeat step 2 until there is only one symbol left with a probability of 1.
4) At the end, a binary Huffman tree is generated. Label all the left branches of the
tree with a 0 and all the right branches with a 1 (or vice-versa). The code for each
of the letters is the sequence of 0's and 1's that lead to it on the tree, starting from
the symbol with a probability of 1.
110
The generated binary tree should be like the following:
Figure B.1: Huffman tree example
Based on the generated tree, the Huffman codes for each letter are as follows:
Table B.1: Generated Huffman codes
Letter
A
E
I
O
U
Probability
0.12
0.42
0.09
0.30
0.07
Huffman Code
100
0
1011
11
1010
Note that letters (or symbols) with higher probabilities are represented by
shorter bit patterns, while those with lower probabilities have longer bit patterns.
Effectively, the source data can now be represented by lesser bits since most
common symbols are being represented by shorter codes.
APPENDIX C
COMPRESSION CORE VERILOG CODES
This appendix presents the Verilog source codes of the compression core and
all its sub-modules. First, the design hierarchy is presented, followed by the Verilog
source codes starting from the top level module. Note, however, that the full source
codes are not given here but can be obtained from the author or the supervisor.
C.1
Design Hierarchy of Compression Core
Figure C.1: Design hierarchy of Compression_Hardware
112
C.2
Verilog Code of Compression_Hardware
module Compression_Hardware (
clk,
rst,
Poll_Amount_In_WE,
Poll_Amount_In,
Interface_Data_In_WE,
Interface_Data_In,
Poll_Amount_Out_RE,
Interface_Data_Out_RE,
In_Readyn,
Poll_Amount_Out_Empty,
Poll_Amount_Out,
Interface_Data_Out_Empty,
Interface_Data_Out );
// Parameter definitions
//----------------------parameter SymbolWIDTH = 16;
parameter DicLEVEL = 7;
parameter MAXWIDTH = 5;
parameter InterfaceWIDTH = 5;
parameter PollAmount = 64;
parameter RejectWIDTH = 4;
parameter FIFOWIDTH = 8;
parameter PendingWIDTH = 6;
// Input signals
//----------------------input clk;
input rst;
input Poll_Amount_In_WE;
input Interface_Data_In_WE;
input Poll_Amount_Out_RE;
input Interface_Data_Out_RE;
input [FIFOWIDTH+InterfaceWIDTH-1:0] Poll_Amount_In;
113
input [2**InterfaceWIDTH-1:0] Interface_Data_In;
// Output signals
//----------------------output In_Readyn;
output Poll_Amount_Out_Empty;
output Interface_Data_Out_Empty;
output [FIFOWIDTH+InterfaceWIDTH-1:0] Poll_Amount_Out;
output [2**InterfaceWIDTH-1:0] Interface_Data_Out;
// Internal variables
//----------------------…
// Module instantiations
//---------------------Compression_Interface_Module u_Compression_Interface_Module ( ... );
Compression_Unit u_Compression_Unit ( .. );
endmodule
C.3
Verilog Code of Compression_Interface_Module
module Compression_Interface_Module (
clk,
rst,
Poll_Amount_In_WE,
Poll_Amount_In,
Interface_Data_In_WE,
Interface_Data_In,
Poll_Amount_Out_RE,
Interface_Data_Out_RE,
Interface_Data_Out_WE,
Data_Restart,
Interface_Data_Out_In,
Update_Bit,
114
Valid_Amount_Out_In,
Restart,
Enable,
Symbol,
In_Readyn,
Poll_Amount_Out_Empty,
Poll_Amount_Out,
Interface_Data_Out_Empty,
Interface_Data_Out );
// Parameter definitions
//----------------------parameter SymbolWIDTH = 16;
parameter InterfaceWIDTH = 5;
parameter PollAmount = 64;
parameter RejectWIDTH = 4;
parameter FIFOWIDTH = 8;
parameter PendingWIDTH = 6;
// Input signals
//----------------------input clk;
input rst;
input Poll_Amount_In_WE;
input Interface_Data_In_WE;
input Poll_Amount_Out_RE;
input Interface_Data_Out_RE;
input Interface_Data_Out_WE;
input Data_Restart;
input Update_Bit;
input [FIFOWIDTH+InterfaceWIDTH-1:0] Poll_Amount_In;
input [2**InterfaceWIDTH-1:0] Interface_Data_In;
input [2**InterfaceWIDTH-1:0] Interface_Data_Out_In;
input [InterfaceWIDTH-1:0] Valid_Amount_Out_In;
// Output signals
//----------------------output Restart;
output Enable;
output In_Readyn;
output Poll_Amount_Out_Empty;
output Interface_Data_Out_Empty;
output [SymbolWIDTH-1:0] Symbol;
output [FIFOWIDTH+InterfaceWIDTH-1:0] Poll_Amount_Out;
115
output [2**InterfaceWIDTH-1:0] Interface_Data_Out;
// Internal wires
//----------------------…
// Module instantiations
//---------------------Compression_Interface_Data_FIFO u_Compression_Interface_Data_In_FIFO ( … );
Compression_Valid_Amount_FIFO u_Compression_Valid_Amount_In_FIFO ( … );
Compression_Interface_Data_FIFO u_Compression_Interface_Data_Out_FIFO (…);
Compression_Valid_Amount_FIFO u_Compression_Valid_Amount_Out_FIFO(…);
Compression_Data_In_Controller u_Compression_Data_In_Controller ( … );
Compression_Data_Out_Controller u_Compression_Data_Out_Controller ( … );
endmodule
C.4
Verilog Code of Compression_Interface_Data_FIFO
module Compression_Interface_Data_FIFO (
clk,
rst,
RE,
WE,
Data_In,
Full,
Empty,
Data_Out,
USEDW );
// Parameter definitions
//----------------------parameter InterfaceWIDTH = 5;
parameter FIFOWIDTH = 8;
// Input signals
//----------------------input clk;
116
input rst;
input RE;
input WE;
input [2**InterfaceWIDTH-1:0] Data_In;
// Output signals
//----------------------output Full;
output Empty;
output [FIFOWIDTH-1:0] USEDW;
output [2**InterfaceWIDTH-1:0] Data_Out;
// Internal wires
//----------------------wire SCLR = 0;
// Module instantiations
//----------------------`ifdef VENDORFIFO
lpm_fifo u_LPM_FIFO ( ... );
`else
Sync_FIFO u_sync_fif0 ( ... );
`endif
endmodule
C.5
Verilog Code of Compression_Valid_Amount_FIFO
module Compression_Valid_Amount_FIFO (
clk,
rst,
RE,
WE,
Data_In,
Full,
Empty,
Data_Out );
// Parameter definitions
parameter InterfaceWIDTH = 5;
parameter FIFOWIDTH = 8;
117
// Input signals
input clk;
input rst;
input RE;
input WE;
input [FIFOWIDTH+InterfaceWIDTH-1:0] Data_In;
// Output signals
output Full;
output Empty;
output [FIFOWIDTH+InterfaceWIDTH-1:0] Data_Out;
// Internal wires
//--------------wire SCLR = 0;
// Module instantiations
//---------------------`ifdef VENDORFIFO
lpm_fifo u_LPM_FIFO ( ... );
`else
Sync_FIFO u_sync_fif0 ( ... );
`endif
endmodule
C.6
Verilog Code of Compression_Data_In_Controller
module Compression_Data_In_Controller (
clk,
rst,
Interface_Data_In_Full,
Interface_Data_In_USEDW,
Poll_Amount_In_Empty,
Poll_Amount_In_Out,
Interface_Data_In_Out,
In_Readyn,
Poll_Amount_In_RE,
Interface_Data_In_RE,
Restart,
Enable,
Symbol );
118
// Parameter definitions
//---------------------parameter SymbolWIDTH = 16;
parameter InterfaceWIDTH = 5;
parameter RejectWIDTH = 4;
parameter PollAmount = 64;
parameter FIFOWIDTH = 8;
parameter NO_DATA = 0;
parameter CHECK_AMOUNT = 1;
parameter NORMAL_PROCESSING = 2;
parameter FINAL_PROCESSING = 3;
parameter RESTARTING = 4;
// Input signals
//-------------input clk;
input rst;
input Interface_Data_In_Full;
input Poll_Amount_In_Empty;
input [FIFOWIDTH-1:0] Interface_Data_In_USEDW;
input [FIFOWIDTH+InterfaceWIDTH-1:0] Poll_Amount_In_Out;
input [2**InterfaceWIDTH-1:0] Interface_Data_In_Out;
// Output signals
//--------------output In_Readyn;
output Poll_Amount_In_RE;
output Interface_Data_In_RE;
output Restart;
output Enable;
output [SymbolWIDTH-1:0] Symbol;
reg In_Readyn;
reg Poll_Amount_In_RE;
reg Interface_Data_In_RE;
reg Restart;
reg Enable;
reg [SymbolWIDTH-1:0] Symbol;
// Internal registers
//-------------------// ...
119
// Internal wires
//----------------// ...
// Logic behavior
//-------------------// ...
endmodule
C.7
Verilog Code of Compression_Data_Out_Controller
module Compression_Data_Out_Controller (
clk,
rst,
Data_Ready,
Data_Restart,
Update_Bit,
Valid_Amount,
Poll_Amount_Out_WE,
Poll_Amount_Out_In );
// Parameter definitions
//---------------------parameter InterfaceWIDTH = 5;
parameter PollAmount = 64;
parameter RejectWIDTH = 4;
parameter FIFOWIDTH = 8;
parameter PendingWIDTH = 6;
parameter NORMAL = 0;
parameter RESTARTING = 1;
// Input signals
//-------------input clk;
input rst;
input Data_Ready;
input Data_Restart;
input Update_Bit;
input [InterfaceWIDTH-1:0] Valid_Amount;
// Output signals
120
//--------------output Poll_Amount_Out_WE;
output [FIFOWIDTH+InterfaceWIDTH-1:0] Poll_Amount_Out_In;
reg Poll_Amount_Out_WE;
reg [FIFOWIDTH+InterfaceWIDTH-1:0] Poll_Amount_Out_In;
// Internal registers
//-------------------// …
// Internal wires
//----------------// …
// Logic behavior
//-------------------// …
endmodule
C.8
Verilog Code of Compression_Unit
module Compression_Unit (
clk,
rst,
restart,
enable,
symbol,
data_out_ready,
data_restart,
data_out,
update_bit,
valid_amount_out );
// Parameter definitions
//---------------------parameter SymbolWIDTH = 16;
parameter DicLEVEL = 7;
parameter MAXWIDTH = 5;
parameter InterfaceWIDTH = 5;
parameter RejectWIDTH = 4;
parameter FIFOWIDTH = 8;
121
// Input signals
//-------------input clk;
input rst;
input restart;
input enable;
input [SymbolWIDTH-1:0] symbol;
// Output signals
//--------------output data_out_ready;
output data_restart;
output update_bit;
output [2**InterfaceWIDTH-1:0] data_out;
output [InterfaceWIDTH-1:0] valid_amount_out;
// Internal wires
//--------------// …
// Module Instantiations
//----------------------LZSS_Coder u_LZSS_Coder ( ... );
Fixed_Huffman_Coder u_Fixed_Huffman_Coder ( … );
Data_Packer u_Data_Packer ( … );
endmodule
C.9
Verilog Code of LZSS_Coder
module LZSS_Coder (
clk,
rst,
restart,
enable,
symbol,
data_out_ready,
data_out );
122
// Parameter definitions
//---------------------parameter SymbolWIDTH = 16;
parameter DicLEVEL = 7;
parameter MAXWIDTH = 5;
parameter RejectWIDTH = 4;
parameter FIFOWIDTH = 8;
// Input signals
//-------------input clk;
input rst;
input restart;
input enable;
input [SymbolWIDTH-1:0] symbol;
// Output signals
//--------------output data_out_ready;
output [MAXWIDTH+DicLEVEL+SymbolWIDTH:0] data_out;
// Internal wires
//--------------// …
// Module Instantiations
//----------------------LZSS_Delay_Tree u_LZSS_Delay_Tree ( … );
LZSS_Systolic_Dictionary u_LZSS_Systolic_Dictionary ( … );
LZSS_Reduction_Tree u_LZSS_Reduction_Tree ( ... );
LZSS_Codeword_Generator u_LZSS_Codeword_Generator ( ... );
endmodule
C.10
Verilog Code of LZSS_Delay_Tree
module LZSS_Delay_Tree (
clk,
rst,
123
restart,
enable,
symbol,
restart_out,
enable_out,
symbol_out );
// Parameter definitions
//---------------------parameter SymbolWIDTH = 16;
parameter DicLEVEL = 7;
// Input signals
//-------------input clk;
input rst;
input restart;
input enable;
input [SymbolWIDTH-1:0] symbol;
// Output signals
//--------------output restart_out;
output enable_out;
output [SymbolWIDTH-1:0] symbol_out;
// Internal wires
//--------------// …
// Logic behavior
//-------------------// …
endmodule
C.11
Verilog Code of LZSS_Delay_PE
module LZSS_Delay_PE (
clk,
rst,
restart,
enable,
124
symbol,
restart_out,
enable_out,
symbol_out );
// Parameter definitions
//---------------------parameter SymbolWIDTH = 16;
// Input signals
//-------------input clk;
input rst;
input restart;
input enable;
input [SymbolWIDTH-1:0] symbol;
// Output signals
//--------------output restart_out;
output enable_out;
output [SymbolWIDTH-1:0] symbol_out;
// Internal registers
//-------------------// …
// Logic behavior
//-------------------// …
endmodule
C.12
Verilog Code of LZSS_Systolic_Dictionary
module LZSS_Systolic_Dictionary (
clk,
rst,
restart,
enable,
symbol,
size_out);
125
// Parameter definitions
//---------------------parameter SymbolWIDTH = 16;
parameter DicLEVEL = 7;
parameter MAXWIDTH = 5;
// Input signals
//-------------input clk;
input rst;
input restart;
input enable;
input [SymbolWIDTH-1:0] symbol;
// Output signals
//--------------output [(2**DicLEVEL)*MAXWIDTH-1:0] size_out;
// Internal wires
//--------------// …
// Logic behavior
//--------------------// …
endmodule
C.13
Verilog Code of LZSS_Dictionary_PE
module LZSS_Dictionary_PE (
clk,
rst,
restart,
enable,
entry_in,
symbol,
entry_out,
size );
// Parameter definitions
//---------------------parameter SymbolWIDTH = 16;
126
parameter DicLEVEL = 7;
parameter MAXWIDTH = 5;
// Input signals
//-------------input clk;
input rst;
input restart;
input enable;
input [SymbolWIDTH-1:0] entry_in;
input [SymbolWIDTH-1:0] symbol;
// Output signals
//--------------output [SymbolWIDTH-1:0] entry_out;
output [MAXWIDTH-1:0] size;
// Internal registers
//-------------------// …
// Internal wires
//--------------// …
//Logic behavior
//----------------------// …
endmodule
C.14
Verilog Code of LZSS_Reduction_Tree
module LZSS_Reduction_Tree (
clk,
rst,
data_in,
size,
offset );
// Parameter definitions
//---------------------parameter DicLEVEL = 7;
parameter MAXWIDTH = 5;
127
// Input signals
//-------------input clk;
input rst;
input [(2**DicLEVEL)*MAXWIDTH-1:0] data_in;
// Output signals
//--------------output [MAXWIDTH-1:0] size;
output [DicLEVEL-1:0] offset;
// Internal registers
//------------------// …
// Internal wires
//--------------// ..
// Logic behavior
//--------------------// …
endmodule
C.15
Verilog Code of LZSS_Reduction_Level_Module
module LZSS_Reduction_Level_Module (
clk,
rst,
data_in,
data_out );
// Parameter definitions
//---------------------parameter DicLEVEL = 7;
parameter MAXWIDTH = 5;
parameter I = 0;
// Input signals
//-------------input clk;
128
input rst;
input [2**DicLEVEL*(DicLEVEL+MAXWIDTH)-1:0] data_in;
// Output signals
//--------------output [2**(DicLEVEL-1)*(DicLEVEL+MAXWIDTH)-1:0] data_out;
// Internal wires
//--------------// …
// Logic behavior
//--------------------// …
endmodule
C.16
Verilog Code of LZSS_Reduction_PE
module LZSS_Reduction_PE (
clk,
rst,
data_in,
data_out );
// Parameter definitions
//---------------------parameter DicLEVEL = 7;
parameter MAXWIDTH = 5;
parameter I = 0;
// Input signals
//-------------input clk;
input rst;
input [2*(DicLEVEL+MAXWIDTH)-1:0] data_in;
// Output signals
//--------------output [DicLEVEL+MAXWIDTH-1:0] data_out;
// Internal registers
//--------------------
129
// …
// Internal wires
//--------------// …
// Logic behavior
//---------------------// …
endmodule
C.17
Verilog Code of LZSS_Codeword_Generator
module LZSS_Codeword_Generator (
clk,
rst,
restart,
enable,
symbol,
size,
offset,
data_out_ready,
data_out );
// Parameter definitions
//---------------------parameter SymbolWIDTH = 16;
parameter DicLEVEL = 7;
parameter MAXWIDTH = 5;
parameter RejectWIDTH = 4;
parameter NORMAL = 0;
parameter PENDING = 1;
// Input signals
//-------------input clk;
input rst;
input restart;
input enable;
input [SymbolWIDTH-1:0] symbol;
input [MAXWIDTH-1:0] size;
130
input [DicLEVEL-1:0] offset;
// Output signals
//--------------output data_out_ready;
output [MAXWIDTH+DicLEVEL+SymbolWIDTH:0] data_out;
reg data_out_ready;
reg [MAXWIDTH+DicLEVEL+SymbolWIDTH:0] data_out;
// Internal reg
//--------------// …
// Internal wires
//---------------// …
// Logic behavior
//---------------------// …
endmodule
C.18
Verilog Code of Fixed_Huffman_Coder
module Fixed_Huffman_Coder (
clk,
rst,
LZSS_ready,
data_in,
last_data_out,
data_out,
valid_count );
// Parameter definitions
//---------------------parameter SymbolWIDTH = 16;
parameter DicLEVEL = 7;
parameter MAXWIDTH = 5;
parameter InterfaceWIDTH = 5;
parameter RejectWIDTH = 4;
131
// Input signals
//-------------input clk;
input rst;
input LZSS_ready;
input [MAXWIDTH+DicLEVEL+SymbolWIDTH:0] data_in;
// Output signals
//--------------output last_data_out;
output [(2*MAXWIDTH)+DicLEVEL+SymbolWIDTH-1:0] data_out;
output [InterfaceWIDTH-1:0] valid_count;
reg last_data_out;
reg [(2*MAXWIDTH)+DicLEVEL+SymbolWIDTH-1:0] data_out;
reg [InterfaceWIDTH-1:0] valid_count;
// Internal reg
//--------------// …
// Internal wires
//---------------// …
// Logic behavior
//---------------------// …
endmodule
C.19
Verilog Code of Data_Packer
module Data_Packer (
clk,
rst,
last_data_in,
data_in,
valid_count,
data_out_ready,
data_restart,
132
data_out,
update_bit,
valid_amount_out );
// Parameter definitions
//---------------------parameter SymbolWIDTH = 16;
parameter DicLEVEL = 7;
parameter MAXWIDTH = 5;
parameter InterfaceWIDTH = 5;
parameter PACKING = 0;
parameter RESTARTING = 1;
// Input signals
//-------------input clk;
input rst;
input last_data_in;
input [(2*MAXWIDTH)+DicLEVEL+SymbolWIDTH-1:0] data_in;
input [InterfaceWIDTH-1:0] valid_count;
// Output signals
//--------------output data_out_ready;
output data_restart;
output update_bit;
output [2**InterfaceWIDTH-1:0] data_out;
output [InterfaceWIDTH-1:0] valid_amount_out;
// Internal reg
//--------------// …
// Internal wires
//---------------// ...
// Logic behavior
//-------------------// ...
endmodule
APPENDIX D
DECOMPRESSION CORE VERILOG CODES
This appendix presents the Verilog source codes of the decompression core
and all its sub-modules. First, the design hierarchy is presented, followed by the
Verilog source codes starting from the top level module. Note, however, that the full
source codes are not given here but can be obtained from the author or the supervisor
D.1
Design Hierarchy of Decompression Core
Figure D.1: Design hierarchy of Decompression_Hardware
134
D.2
Verilog Code of Decompression_Hardware
module Decompression_Hardware (
clk,
rst,
Interface_Data_In_WE,
Interface_Data_In,
Poll_Amount_Out_RE,
Interface_Data_Out_RE,
In_Readyn,
Poll_Amount_Out_Empty,
Poll_Amount_Out,
Interface_Data_Out_Empty,
Interface_Data_Out );
// Parameter definitions
//----------------------parameter SymbolWIDTH = 16;
parameter DicLEVEL = 7;
parameter MAXWIDTH = 5;
parameter InterfaceWIDTH = 5;
parameter PollAmount = 64;
parameter RejectWIDTH = 4;
parameter FIFOWIDTH = 8;
// Input signals
//----------------------input clk;
input rst;
input Interface_Data_In_WE;
input Poll_Amount_Out_RE;
input Interface_Data_Out_RE;
input [2**InterfaceWIDTH-1:0] Interface_Data_In;
// Output signals
//----------------------output In_Readyn;
output Poll_Amount_Out_Empty;
output Interface_Data_Out_Empty;
output [FIFOWIDTH+InterfaceWIDTH-1:0] Poll_Amount_Out;
output [2**InterfaceWIDTH-1:0] Interface_Data_Out;
135
// Internal wires
//----------------------// …
// Module instantiations
//---------------------Decompression_Interface_Module u_Decompression_Interface_Module ( … );
Decompression_Unit u_Decompression_Unit ( … );
endmodule
D.3
Verilog Code of Decompression_Interface_Module
module Decompression_Interface_Module (
clk,
rst,
Interface_Data_In_WE,
Interface_Data_In,
Poll_Amount_Out_RE,
Interface_Data_Out_RE,
Interface_Data_In_RE,
Data_Out_Ready,
Data_Restart,
Data_Out,
Interface_Data_In_Empty,
Interface_Data_In_Out,
In_Readyn,
Poll_Amount_Out_Empty,
Poll_Amount_Out,
Interface_Data_Out_Empty,
Interface_Data_Out );
// Parameter definitions
//----------------------parameter SymbolWIDTH = 16;
parameter InterfaceWIDTH = 5;
parameter PollAmount = 64;
parameter RejectWIDTH = 4;
parameter FIFOWIDTH = 8;
parameter PendingWIDTH = 4;
// Input signals
//-----------------------
136
input clk;
input rst;
input Interface_Data_In_WE;
input Poll_Amount_Out_RE;
input Interface_Data_Out_RE;
input Interface_Data_In_RE;
input Data_Out_Ready;
input Data_Restart;
input [SymbolWIDTH-1:0] Data_Out;
input [2**InterfaceWIDTH-1:0] Interface_Data_In;
// Output signals
//----------------------output Interface_Data_In_Empty;
output In_Readyn;
output Poll_Amount_Out_Empty;
output Interface_Data_Out_Empty;
output [FIFOWIDTH+InterfaceWIDTH-1:0] Poll_Amount_Out;
output [2**InterfaceWIDTH-1:0] Interface_Data_In_Out;
output [2**InterfaceWIDTH-1:0] Interface_Data_Out;
// Internal wires
//----------------------// …
// Module instantiations
//---------------------Decompression_Interface_Data_FIFO u_Decompression_Interface_Data_In_FIFO (
… );
Decompression_Interface_Data_FIFO u_Decompression_Interface_Data_Out_FIFO
( … );
Decompression_Valid_Amount_FIFO u_Decompression_Valid_Amount_Out_FIFO
( ... );
Decompression_Data_In_Controller u_Decompression_Data_In_Controller ( …. );
Decompression_Data_Out_Controller u_Decompression_Data_Out_Controller ( …
);
endmodule
137
D.4
Verilog Code of Decompression_Interface_Data_FIFO
module Decompression_Interface_Data_FIFO (
clk,
rst,
RE,
WE,
Data_In,
Full,
Empty,
Data_Out,
USEDW );
// Parameter definitions
//----------------------parameter InterfaceWIDTH = 5;
parameter FIFOWIDTH = 8;
// Input signals
//----------------------input clk;
input rst;
input RE;
input WE;
input [2**InterfaceWIDTH-1:0] Data_In;
// Output signals
//----------------------output Full;
output Empty;
output [FIFOWIDTH-1:0] USEDW;
output [2**InterfaceWIDTH-1:0] Data_Out;
// Internal wires
//----------------------// ..
// Module instantiations
//----------------------`ifdef VENDORFIFO
lpm_fifo u_LPM_FIFO ( ... );
`else
Sync_FIFO u_sync_fif0 ( ... );
138
`endif
endmodule
D.5
Verilog Code of Decompression_Valid_Amount_FIFO
module Decompression_Valid_Amount_FIFO (
clk,
rst,
RE,
WE,
Data_In,
Full,
Empty,
Data_Out );
// Parameter definitions
//----------------------parameter InterfaceWIDTH = 5;
parameter FIFOWIDTH = 8;
// Input signals
//--------------input clk;
input rst;
input RE;
input WE;
input [FIFOWIDTH+InterfaceWIDTH-1:0] Data_In;
// Output signals
//--------------output Full;
output Empty;
output [FIFOWIDTH+InterfaceWIDTH-1:0] Data_Out;
// Internal wires
//--------------// ...
// Module instantiations
//---------------------`ifdef VENDORFIFO
139
lpm_fifo u_LPM_FIFO ( ... );
`else
Sync_FIFO u_sync_fif0 ( ... );
`endif
endmodule
D.6
Verilog Code of Decompression_Data_In_Controller
module Decompression_Data_In_Controller (
Interface_Data_In_Full,
Interface_Data_In_USEDW,
In_Readyn );
// Parameter definitions
//---------------------parameter PollAmount = 64;
parameter FIFOWIDTH = 8;
// Input signals
//-------------input Interface_Data_In_Full;
input [FIFOWIDTH-1:0] Interface_Data_In_USEDW;
// Output signals
//--------------output In_Readyn;
reg In_Readyn;
// Internal wires
//----------------// ...
//Logic behavior
//-------------------// ...
endmodule
140
D.7
Verilog Code of Decompression_Data_Out_Controller
module Decompression_Data_Out_Controller (
clk,
rst,
Data_Out_Ready,
Data_Restart,
Data_Out,
Poll_Amount_Out_WE,
Poll_Amount_Out_In,
Interface_Data_Out_WE,
Interface_Data_Out_In );
// Parameter definitions
//---------------------parameter SymbolWIDTH = 16;
parameter InterfaceWIDTH = 5;
parameter PollAmount = 64;
parameter RejectWIDTH = 4;
parameter FIFOWIDTH = 8;
parameter PendingWIDTH = 6;
parameter NORMAL = 0;
parameter RESTARTING = 1;
// Input signals
//-------------input clk;
input rst;
input Data_Out_Ready;
input Data_Restart;
input [SymbolWIDTH-1:0] Data_Out;
// Output signals
//--------------output Poll_Amount_Out_WE;
output Interface_Data_Out_WE;
output [FIFOWIDTH+InterfaceWIDTH-1:0] Poll_Amount_Out_In;
output [2**InterfaceWIDTH-1:0] Interface_Data_Out_In;
reg Poll_Amount_Out_WE;
reg Interface_Data_Out_WE;
reg [FIFOWIDTH+InterfaceWIDTH-1:0] Poll_Amount_Out_In;
141
reg [2**InterfaceWIDTH-1:0] Interface_Data_Out_In;
// Internal registers
//-------------------// …
// Internal wires
//----------------// …
//Logic behavior
//--------------------// …
endmodule
D.8
Verilog Code of Decompression_Unit
module Decompression_Unit (
clk,
rst,
Interface_Data_In_Empty,
Interface_Data_In,
Next_Interface_Data_In,
Data_Restart,
Data_Out_Ready,
Data_Out );
// Parameter definitions
//---------------------parameter SymbolWIDTH = 16;
parameter DicLEVEL = 7;
parameter MAXWIDTH = 5;
parameter InterfaceWIDTH = 5;
parameter RejectWIDTH = 4;
parameter FIFOWIDTH = 8;
// Input signals
//-------------input clk;
input rst;
input Interface_Data_In_Empty;
input [2**InterfaceWIDTH-1:0] Interface_Data_In;
142
// Output signals
//--------------output Next_Interface_Data_In;
output Data_Restart;
output Data_Out_Ready;
output [SymbolWIDTH-1:0] Data_Out;
// Internal wires
//--------------// …
// Module Instantiations
//----------------------Data_Unpacker u_Data_Unpacker ( … );
Fixed_Huffman_Decoder u_Fixed_Huffman_Decoder ( … );
LZSS_Codeword_FIFO u_LZSS_Codeword_FIFO ( … );
LZSS_Expander u_LZSS_Expander ( … );
endmodule
D.9
Verilog Code of Data_Unpacker
module Data_Unpacker (
clk,
Shift_Interface_Data,
Pending_Interface_Count,
Interface_Data_In,
Unpacked_Data );
// Parameter definitions
//---------------------parameter SymbolWIDTH = 16;
parameter DicLEVEL = 7;
parameter MAXWIDTH = 5;
parameter InterfaceWIDTH = 5;
parameter MAXCOUNT = 2*MAXWIDTH + DicLEVEL;
// Input signals
//-------------input clk;
143
input Shift_Interface_Data;
input [InterfaceWIDTH-1:0] Pending_Interface_Count;
input [2**InterfaceWIDTH-1:0] Interface_Data_In;
// Output signals
//--------------output [2**InterfaceWIDTH-1:0] Unpacked_Data;
reg [2**InterfaceWIDTH-1:0] Unpacked_Data;
// Internal reg
//--------------// …
// Internal wires
//---------------// …
//Logic behavior
//------------------// …
endmodule
D.10
Verilog Code of Fixed_Huffman_Decoder
module Fixed_Huffman_Decoder (
clk,
rst,
LZSS_Code_Full,
Interface_Data_In_Empty,
Unpacked_Data,
Shift_Interface_Data,
Next_Interface_Data_In,
LZSS_Code_Ready,
LZSS_Code,
Pending_Interface_Count );
// Parameter definitions
//---------------------parameter SymbolWIDTH = 16;
parameter DicLEVEL = 7;
parameter MAXWIDTH = 5;
144
parameter InterfaceWIDTH = 5;
parameter RejectWIDTH = 4;
parameter NO_DATA = 0;
parameter UPDATING = 1;
parameter DECODING = 2;
// Input signals
//-------------input clk;
input rst;
input LZSS_Code_Full;
input Interface_Data_In_Empty;
input [2**InterfaceWIDTH-1:0] Unpacked_Data;
// Output signals
//--------------output Shift_Interface_Data;
output Next_Interface_Data_In;
output LZSS_Code_Ready;
output [MAXWIDTH+DicLEVEL+SymbolWIDTH:0] LZSS_Code;
output [InterfaceWIDTH-1:0] Pending_Interface_Count;
reg Shift_Interface_Data;
reg Next_Interface_Data_In;
reg LZSS_Code_Ready;
reg [MAXWIDTH+DicLEVEL+SymbolWIDTH:0] LZSS_Code;
// Internal variables
//-------------------// …
// Internal wires
//---------------// …
// Logic behavior
//-------------------// …
endmodule
145
D.11
Verilog Code of LZSS_Codeword_FIFO
module LZSS_Codeword_FIFO (
clk,
rst,
RE,
WE,
Data_In,
Full,
Empty,
Data_Out );
// Parameter definitions
//----------------------parameter SymbolWIDTH = 16;
parameter DicLEVEL = 7;
parameter MAXWIDTH = 5;
parameter FIFOWIDTH = 8;
// Input signals
//----------------------input clk;
input rst;
input RE;
input WE;
input [MAXWIDTH+DicLEVEL+SymbolWIDTH:0] Data_In;
// Output signals
//----------------------output Full;
output Empty;
output [MAXWIDTH+DicLEVEL+SymbolWIDTH:0] Data_Out;
// Internal wires
//----------------------// …
// Module instantiations
//----------------------`ifdef VENDORFIFO
lpm_fifo u_LPM_FIFO ( … );
`else
Sync_FIFO u_sync_fif0 ( … );
`endif
146
endmodule
D.12
Verilog Code of LZSS_Expander
module LZSS_Expander (
clk,
rst,
LZSS_Codeword_Empty,
LZSS_Codeword,
Next_LZSS_Codeword,
Data_Restart,
Data_Out_Ready,
Data_Out );
// Parameter definitions
//---------------------parameter SymbolWIDTH = 16;
parameter DicLEVEL = 7;
parameter MAXWIDTH = 5;
parameter RejectWIDTH = 4;
// Input signals
//-------------input clk;
input rst;
input LZSS_Codeword_Empty;
input [MAXWIDTH+DicLEVEL+SymbolWIDTH:0] LZSS_Codeword;
// Output signals
//--------------output Next_LZSS_Codeword;
output Data_Restart;
output Data_Out_Ready;
output [SymbolWIDTH-1:0] Data_Out;
reg Data_Restart;
reg Data_Out_Ready;
reg [SymbolWIDTH-1:0] Data_Out;
// Internal wires
//--------------// …
147
// Module Instantiations
//----------------------Expander_Codeword_Analyzer u_Expander_Codeword_Analyzer ( … );
Expander_Delay_Tree u_Expander_Delay_Tree ( … );
Expander_Dictionary u_Expander_Dictionary ( … );
Expander_Symbol_Generator u_Expander_Symbol_Generator ( … );
endmodule
D.13
Verilog Code of Expander_Codeword_Analyzer
module Expander_Codeword_Analyzer (
clk,
rst,
LZSS_Codeword_Empty,
LZSS_Codeword,
Next_LZSS_Codeword,
Restart,
Enable,
Flag,
Offset_Req,
Symbol );
// Parameter definitions
//---------------------parameter SymbolWIDTH = 16;
parameter DicLEVEL = 7;
parameter MAXWIDTH = 5;
parameter RejectWIDTH = 4;
parameter NO_DATA = 0;
parameter ANALYZING = 1;
parameter POINTER_PROCESSING = 2;
parameter READ_MEMORY = 3;
// Input signals
//-------------input clk;
input rst;
input LZSS_Codeword_Empty;
148
input [MAXWIDTH+DicLEVEL+SymbolWIDTH:0] LZSS_Codeword;
// Output signals
//--------------output Next_LZSS_Codeword;
output Restart;
output Enable;
output Flag;
output [DicLEVEL-1:0] Offset_Req;
output [SymbolWIDTH-1:0] Symbol;
reg Next_LZSS_Codeword;
reg Restart;
reg Enable;
reg Flag;
reg [SymbolWIDTH-1:0] Symbol;
// Internal registers
//------------------// …
// Internal wires
//--------------// …
// Logic behavior
//---------------------// …
endmodule
D.14
Verilog Code of Expander_Delay_Tree
module Expander_Delay_Tree (
clk,
rst,
Restart,
Enable,
Flag,
Symbol,
Restart_Out,
Enable_Out,
Flag_Out,
149
Symbol_Out );
// Parameter definitions
//---------------------parameter SymbolWIDTH = 16;
parameter NumOfDelay = 1;
// Input signals
//-------------input clk;
input rst;
input Restart;
input Enable;
input Flag;
input [SymbolWIDTH-1:0] Symbol;
// Output signals
//--------------output Restart_Out;
output Enable_Out;
output Flag_Out;
output [SymbolWIDTH-1:0] Symbol_Out;
// Internal wires
//--------------// …
// Logic behavior
//--------------------// …
endmodule
D.15
Verilog Code of Expander_Delay_PE
module Expander_Delay_PE (
clk,
rst,
Restart,
Enable,
Flag,
Symbol,
Restart_Out,
150
Enable_Out,
Flag_Out,
Symbol_Out );
// Parameter definitions
//---------------------parameter SymbolWIDTH = 16;
// Input signals
//-------------input clk;
input rst;
input Restart;
input Enable;
input Flag;
input [SymbolWIDTH-1:0] Symbol;
// Output signals
//--------------output Restart_Out;
output Enable_Out;
output Flag_Out;
output [SymbolWIDTH-1:0] Symbol_Out;
reg Restart_Out;
reg Enable_Out;
reg Flag_Out;
reg [SymbolWIDTH-1:0] Symbol_Out;
// Logic behavior
//-------------------// …
endmodule
D.16
Verilog Code of Expander_Dictionary
module Expander_Dictionary (
clk,
rst,
Restart,
Enable,
151
Offset_Req,
Entry_In,
Dictionary_Data );
// Parameter definitions
//---------------------parameter SymbolWIDTH = 16;
parameter DicLEVEL = 7;
// Input signals
//-------------input clk;
input rst;
input Restart;
input Enable;
input [DicLEVEL-1:0] Offset_Req;
input [SymbolWIDTH-1:0] Entry_In;
// Output signals
//--------------output [SymbolWIDTH-1:0] Dictionary_Data;
reg [SymbolWIDTH-1:0] Dictionary_Data;
// Internal registers
//-------------------// …
// Module instantiation
//---------------------------`ifdef VENDORRAM
lpm_ram_dp u_LPM_RAM_DP ( … );
`else
Dual_Port_RAM u_RAM_DP ( … );
`endif
// Logic behavior
//-----------------------// …
endmodule
152
D.17
Verilog Code of Expander_Symbol_Generator
module Expander_Symbol_Generator (
Restart,
Enable,
Flag,
Symbol,
Dictionary_Data,
Data_Restart,
Data_Out_Ready,
Data_Out );
// Parameter definitions
//---------------------parameter SymbolWIDTH = 16;
// Input signals
//-------------input Restart;
input Enable;
input Flag;
input [SymbolWIDTH-1:0] Symbol;
input [SymbolWIDTH-1:0] Dictionary_Data;
// Output signals
//--------------output Data_Restart;
output Data_Out_Ready;
output [SymbolWIDTH-1:0] Data_Out;
reg Data_Restart;
reg Data_Out_Ready;
reg [SymbolWIDTH-1:0] Data_Out;
// Logic behavior
//-------------------// …
endmodule
APPENDIX E
GENERIC MEMORY MODULE VERILOG CODES
This appendix presents the Verilog source codes of the generic dual port
memory and synchronous FIFO designs.
E.1
Verilog Code of Dual_Port_RAM
module Dual_Port_RAM (
data,
rdaddress,
wraddress,
rdclk,
wrclk,
rden,
wren,
q );
parameter DATA_WIDTH = 16 ;
parameter ADDR_WIDTH = 7 ;
parameter NUMWORDS = 2**ADDR_WIDTH;
//--------------Input Ports----------------------input rdclk;
input wrclk;
input rden;
input wren;
input [ADDR_WIDTH-1:0] rdaddress;
input [ADDR_WIDTH-1:0] wraddress;
input [DATA_WIDTH-1:0] data;
154
//--------------Output Ports----------------------output [DATA_WIDTH-1:0] q;
//--------------Internal variables---------------reg [DATA_WIDTH-1:0] q;
reg [DATA_WIDTH-1:0] mem [0:NUMWORDS-1];
//--------------Code Starts Here-----------------// Memory Write Block
always @ ( posedge wrclk )
begin
if ( wren )
mem[wraddress] <= data;
end
// Memory Read Block
always @ ( posedge rdclk )
begin
if ( rden )
q <= mem[rdaddress];
end
endmodule
E.2
Verilog Code of Sync_FIFO
module Sync_FIFO (
data,
clock,
aclr,
rdreq,
wrreq,
full,
empty,
usedw,
q );
parameter DATA_WIDTH = 16;
parameter ADDR_WIDTH = 8;
parameter NUMWORDS = 2**ADDR_WIDTH;
//--------------Input Ports----------------------input clock;
input aclr;
input rdreq;
155
input wrreq;
input [DATA_WIDTH-1:0] data;
//--------------Output Ports----------------------output full;
output empty;
output [ADDR_WIDTH-1:0] usedw;
output [DATA_WIDTH-1:0] q;
//-----------Internal variables------------------reg [ADDR_WIDTH-1:0] wr_pointer;
reg [ADDR_WIDTH-1:0] rd_pointer;
reg [ADDR_WIDTH-1:0] usedw;
wire [DATA_WIDTH-1:0] data_ram ;
Dual_Port_RAM #(.DATA_WIDTH(DATA_WIDTH),
.ADDR_WIDTH(ADDR_WIDTH))
u_RAM_DP (.data(data),
.rdaddress(rd_pointer),
.wraddress(wr_pointer),
.rdclk(clock),
.wrclk(clock),
.rden(rdreq),
.wren(wrreq),
.q(data_ram));
always @ ( posedge clock or posedge aclr )
begin : WRITE_POINTER
if ( aclr )
wr_pointer <= 0;
else
begin
if ( wrreq )
wr_pointer <= wr_pointer + 1;
end
end
always @ ( posedge clock or posedge aclr )
begin : READ_POINTER
if ( aclr )
rd_pointer <= 0;
else
begin
if ( rdreq )
rd_pointer <= rd_pointer + 1;
end
156
end
always @ ( posedge clock or posedge aclr )
begin : STATUS_COUNTER
if ( aclr )
usedw <= 0;
else
begin
// Read but no write.
if ( rdreq && !wrreq && (usedw != 0) )
usedw <= usedw - 1;
// Write but no read.
else if ( wrreq && !rdreq && (usedw != NUMWORDS) )
usedw <= usedw + 1;
end
end
assign full = (usedw == (NUMWORDS-1));
assign empty = (usedw == 0);
assign q = data_ram;
endmodule
APPENDIX F
NIOS SYSTEM COMPRESSION PROCESSOR CORE VERILOG CODES
This appendix presents the Verilog source codes of the NIOS system
compression processor core and its sub-module. First, the design hierarchy is
presented, followed by the Verilog source codes starting from the top level module.
Verilog codes for Compression_Hardware module can be found in Appendix C.
Note, however, that the full source codes are not given here but can be obtained from
the author or the supervisor.
F.1
Design Hierarchy of NIOS Compression Processor Core
Figure F.1: Design hierarchy of NIOS compression processor core
158
F.2
Verilog Code of Nios_LZSS_Compress_Processor
module Nios_LZSS_Compress_Processor (
clk,
rst,
chipselect,
address,
writedata,
readdata );
// Parameter definitions
//----------------------parameter Bus_Size = 32;
parameter SymbolWIDTH = 16;
parameter DicLEVEL = 7;
parameter MAXWIDTH = 5;
parameter InterfaceWIDTH = 5;
parameter PollAmount = 64;
parameter RejectWIDTH = 4;
parameter FIFOWIDTH = 8;
parameter PendingWIDTH = 6;
// Input signals
//----------------------input clk;
input rst;
input chipselect;
input [2:0] address;
input [Bus_Size-1:0] writedata;
// Output signals
//----------------------output [Bus_Size-1:0] readdata;
// Internal registers
//----------------------// …
// Internal wires
//----------------------// …
// Module instantiations
//---------------------------LZSS_Compress_Avalon_Interface u_LZSS_Compress_Avalon_Interface ( … );
Compression_Hardware u_Compression_Hardware ( … );
159
endmodule
F.3
Verilog Code of LZSS_Compress_Avalon_Interface
module LZSS_Compress_Avalon_Interface (
clk,
rst,
chipselect,
address,
writedata,
readdata,
LZSS_control,
Reset_LZSS,
LZSS_data,
LZSS_poll_amount_in,
LZSS_Status,
LZSS_poll_amount_out,
LZSS_result );
// Parameter definitions
//----------------------parameter Bus_Size = 32;
// Input signals
//----------------------input clk;
input rst;
input chipselect;
input [2:0] address;
input [2:0] LZSS_Status;
input [12:0] LZSS_poll_amount_out;
input [Bus_Size-1:0] writedata;
input [Bus_Size-1:0] LZSS_result;
// Output signals
//----------------------output Reset_LZSS;
output [3:0] LZSS_control;
output [12:0] LZSS_poll_amount_in;
output [Bus_Size-1:0] readdata;
160
output [Bus_Size-1:0] LZSS_data;
// Internal registers
//----------------------// …
// Internal wires
//----------------------// …
LZSS_Interface_Register u_LZSS_Interface_Register ( … );
LZSS_Interface_Controller u_LZSS_Interface_Controller ( … );
endmodule
F.4
Verilog Code of LZSS_Interface_Register
module LZSS_Interface_Register (
clk,
rst,
chipselect,
address,
writedata,
readdata,
LZSS_handshaking_control,
LZSS_data,
LZSS_poll_amount_in,
Reset_LZSS,
LZSS_status,
LZSS_poll_amount_out,
LZSS_result );
// Parameter definitions
//----------------------parameter BusSize = 32;
// Input signals
//----------------------input clk;
input rst;
input chipselect;
input [2:0] address;
input [3:0] LZSS_status;
161
input [12:0] LZSS_poll_amount_out;
input [BusSize-1:0] writedata;
input [BusSize-1:0] LZSS_result;
// Output signals
//----------------------output Reset_LZSS;
output [3:0] LZSS_handshaking_control;
output [12:0] LZSS_poll_amount_in;
output [BusSize-1:0] readdata;
output [BusSize-1:0] LZSS_data;
// Internal registers
//----------------------// …
// Internal wires
//----------------------// ...
// Logic behavior
//--------------------// ...
endmodule
F.5
Verilog Code of LZSS_Interface_Controller
module LZSS_Interface_Controller (
clk,
rst,
LZSS_handshaking_control,
fetch_done,
LZSS_control );
// Parameter definitions
//----------------------parameter POWER_UP_RST = 0;
parameter LOAD_DATA = 1;
parameter LOAD_POLL_AMOUNT_IN = 2;
parameter LOAD_POLL_AMOUNT_OUT = 3;
162
parameter LOAD_RESULT = 4;
parameter DONE = 5;
// Input signals
//----------------------input clk;
input rst;
input [3:0] LZSS_handshaking_control;
// Output signals
//----------------------output fetch_done;
output [3:0] LZSS_control;
// Internal registers
//----------------------// …
// Internal wires
//----------------------// …
// Logic behavior
//----------------------// …
endmodule
APPENDIX G
NIOS SYSTEM DECOMPRESSION PROCESSOR CORE VERILOG CODES
This appendix presents the Verilog source codes of the NIOS system
decompression processor core and its sub-module. First, the design hierarchy is
presented, followed by the Verilog source codes starting from the top level module.
Verilog codes for Decompression_Hardware module can be found in Appendix D.
Note, however, that the full source codes are not given here but can be obtained from
the author or the supervisor.
G.1
Design Hierarchy of NIOS Decompression Processor Core
Figure G.1: Design hierarchy of NIOS decompression processor core
164
G.2
Verilog Code of Nios_LZSS_Decompress_Processor
module Nios_LZSS_Decompress_Processor (
clk,
rst,
chipselect,
address,
writedata,
readdata );
// Parameter definitions
parameter Bus_Size = 32;
parameter SymbolWIDTH = 16;
parameter DicLEVEL = 7;
parameter MAXWIDTH = 5;
parameter InterfaceWIDTH = 5;
parameter PollAmount = 64;
parameter RejectWIDTH = 4;
parameter FIFOWIDTH = 8;
parameter PendingWIDTH = 6;
// Input signals
input clk;
input rst;
input chipselect;
input [2:0] address;
input [Bus_Size-1:0] writedata;
// Output signals
output [Bus_Size-1:0] readdata;
// Internal registers
//-----------------------// …
// Internal wires
//------------------// …
// Module instantiations
//----------------------------LZSS_Decompress_Avalon_Interface u_LZSS_Decompress_Avalon_Interface ( …
);
Decompression_Hardware u_Decompression_Hardware ( … );
endmodule
165
G.3
Verilog Code of LZSS_Decompress_Avalon_Interface
module LZSS_Decompress_Avalon_Interface (
clk,
rst,
chipselect,
address,
writedata,
readdata,
LZSS_control,
Reset_lzss,
LZSS_data,
LZSS_Status,
LZSS_poll_amount_out,
LZSS_result );
// Parameter definitions
//----------------------parameter Bus_Size = 32;
// Input signals
//----------------------input clk;
input rst;
input chipselect;
input [2:0] address;
input [2:0] LZSS_Status;
input [12:0] LZSS_poll_amount_out;
input [Bus_Size-1:0] writedata;
input [Bus_Size-1:0] LZSS_result;
// Output signals
//----------------------output Reset_lzss;
output [2:0] LZSS_control;
output [Bus_Size-1:0] readdata;
output [Bus_Size-1:0] LZSS_data;
// Internal registers
//----------------------// …
166
// Internal wires
//----------------------// …
// Module instantiations
//----------------------LZSS_Interface_Register_2 u_LZSS_Interface_Register_2 ( … );
LZSS_Interface_Controller_2 u_LZSS_Interface_Controller_2 ( … );
endmodule
G.4
Verilog Code of LZSS_Interface_Register_2
module LZSS_Interface_Register_2 (
clk,
rst,
chipselect,
address,
writedata,
readdata,
LZSS_handshaking_control,
Reset_lzss,
LZSS_data,
LZSS_status,
LZSS_poll_amount_out,
LZSS_result );
// Parameter definitions
//----------------------parameter BusSize = 32;
// Input signals
//----------------------input clk;
input rst;
input chipselect;
input [2:0] address;
input [3:0] LZSS_status;
input [12:0] LZSS_poll_amount_out;
167
input [BusSize-1:0] writedata;
input [BusSize-1:0] LZSS_result;
// Output signals
//----------------------output Reset_lzss;
output [2:0] LZSS_handshaking_control;
output [BusSize-1:0] readdata;
output [BusSize-1:0] LZSS_data;
// Internal registers
//----------------------// …
// Internal wires
//----------------------// …
// Logic behavior
//--------------------// …
endmodule
G.5
Verilog Code of LZSS_Interface_Controller_2
module LZSS_Interface_Controller_2 (
clk,
rst,
LZSS_handshaking_control,
fetch_done,
LZSS_control );
// Parameter definitions
//----------------------parameter POWER_UP_RST = 0;
parameter LOAD_DATA = 1;
parameter LOAD_POLL_AMOUNT_OUT = 2;
parameter LOAD_RESULT = 3;
parameter DONE = 4;
// Input signals
//----------------------input clk;
168
input rst;
input [2:0] LZSS_handshaking_control;
// Output signals
//----------------------output fetch_done;
output [2:0] LZSS_control;
// Internal registers
//----------------------// …
// Internal wires
//----------------------// …
// Logic behavior
//---------------------// …
endmodule
APPENDIX H
HARDWARE TEST SYSTEM FIRMWARE C CODES
This appendix presents the C source codes of the data compression system
hardware evaluation platform firmware. Note, however, that the full source codes
are not given here but can be obtained from the author or the supervisor.
H.1
C Source Code of LZSS_Fixed_Vector
#include "excalibur.h"
void main ()
{
//-------------------------------------// Define address mapping
//-------------------------------------volatile int *LZSS_Compress_data
volatile int *LZSS_Compress_control
volatile int *LZSS_Compress_poll_amount_in
volatile int *LZSS_Compress_status
volatile int *LZSS_Compress_poll_amount_out
volatile int *LZSS_Compress_result
volatile int *LZSS_Decompress_data
volatile int *LZSS_Decompress_control
volatile int *LZSS_Decompress_status
volatile int *LZSS_Decompress_poll_amount_out
volatile int *LZSS_Decompress_result
//---------------------------------------------// Define testvector #1 - random data
//----------------------------------------------
= (int *) 0x000004A0;
= (int *) 0x000004A4;
= (int *) 0x000004A8;
= (int *) 0x000004AC;
= (int *) 0x000004B0;
= (int *) 0x000004B4;
= (int *) 0x000004C0;
= (int *) 0x000004C4;
= (int *) 0x000004C8;
= (int *) 0x000004CC;
= (int *) 0x000004D0;
170
// …
//---------------------------------------------------------// Define testvector #2 - highly redundant data
//---------------------------------------------------------// …
//----------------------------// Variable declaration
//----------------------------// …
//----------------------------// Variable initialization
//----------------------------// …
//--------------------------------------------// Start LZSS Compression Process
//--------------------------------------------//Reset_LZSS_compression module
*LZSS_Compress_control = 0x10;
*LZSS_Compress_control = 0x00;
//Send data to LZSS_processor for compression
while (1)
{
// …
}
//--------------------------------------------// Start LZSS Decompression Process
//--------------------------------------------//Reset LZSS decompressio module
*LZSS_Decompress_control = 0x08;
*LZSS_Decompress_control = 0x00;
//Send data to LZSS_processor for decompression
while (1)
{
// …
}
//-----------------------------------------------------------------------------// Check results of compression and decompression processing
//-----------------------------------------------------------------------------// …
}
APPENDIX I
DESIGN SIMULATION OUTPUT WAVEFORM
This appendix presents the design simulation results and output waveforms of
the compression and decompression core using Quartus II software.
For the compression processor core, there are eight test sets. Test #1-4 are
used to verify functionality of the Verilog design by comparing its outputs to the
VHDL version. Test #5-8 are used to verify the functionality of the generic dual port
memory and FIFO design which are used by the compression processor core.
For the decompression processor core, there are nine test sets. Test #1-4 and
Test #5-8 perform similar verification for the decompression core as explained
above. Test #9 is used to validate the hardware bug of the decompression core and to
verify whether the proposed hardware patch is functionally correct.
Note, however, that the full design simulation results are not given here but
can be obtained from the author or the supervisor.
I.1
Compression Processor Core Test #1
I.1.1
Test Stimulus Description
172
Input Data i.e. Interface_Data_In = 0x22221111, 0x11111111, 0x11112222,
0x22221111, 0x3333
I.1.2
Input Data Size i.e. Poll_Amount_In: 0x80, 0x10
Simulation Waveform of VHDL Design
Figure I.1: Simulation waveform of VHDL design compression core test #1
I.1.3
Simulation Waveform of Verilog Design
Figure I.2: Simulation waveform of Verilog design compression core test #1
I.1.4
Simulation Result Comparison
Table I.1: Simulation result comparisons between VHDL and Verilog design for
compression core test
Compressed Data Size,
Poll_Amount_Out
Compressed Data,
Interface_Data_Out
I.2
VHDL
0x53, 0x00
0x11111110,
0x98413C05,
0xFFF87F99
Verilog
0x53, 0x00
Note
Both HDL design produce
the same compressed data
size
0x11111110, Both HDL design produce
0x98413C05, the same compressed data
0xFFF87F99 output
Compression Processor Core Test #2 - #4
173
// …
I.3
Compression Processor Core Test #5
I.3.1
Test Stimulus Description
Input Data i.e. Interface_Data_In = 0x22221111, 0x11111111, 0x11112222,
0x22221111, 0x3333
I.3.2
Input Data Size i.e. Poll_Amount_In: 0x80, 0x10
Simulation Waveform of Verilog Design Using IP Memories
Figure I.3: Simulation waveform of Verilog design using IP memories compression
core test #5
I.3.3
Simulation Waveform of Verilog Design Using Generic Memories
Figure I.4: Simulation waveform of Verilog design using generic memories
compression core test #5
I.3.4
Simulation Result Comparison
174
Table I.2: Simulation result comparisons between IP and generic memory design for
compression core test
IP Memory
Compressed Data Size,
Poll_Amount_Out
Compressed Data,
Interface_Data_Out
I.4
0x53, 0x00
0x11111110,
0x98413C05,
0xFFF87F99
Generic
Memory
0x53, 0x00
Note
Both memory design
produce the same
compressed data size
0x11111110, Both memory design
0x98413C05, produce the same
0xFFF87F99 compressed data output
Compression Processor Core Test #6 - #8
// …
I.5
Decompression Processor Core Test #1
I.5.1
Test Stimulus Description
I.5.2
Input Data i.e. Interface_Data_In = 0x11111110, 0x98413C05, 0xFFF87F99
Simulation Waveform of VHDL Design
Figure I.5: Simulation waveform of VHDL design decompression core test #1
I.5.3
Simulation Waveform of Verilog Design
175
Figure I.6: Simulation waveform of Verilog design decompression core test #1
I.5.4
Simulation Result Comparison
Table I.3: Simulation result comparisons between VHDL and Verilog design for
decompression core test
Restored Data Size,
Poll_Amount_Out
Restored Data,
Interface_Data_Out
I.6
VHDL
0x80, 0x10
Verilog
0x80, 0x10
0x22221111,
0x11111111,
0x11112222,
0x22221111,
0x00003333
0x22221111,
0x11111111,
0x11112222,
0x22221111,
0x00003333
Note
Both HDL design produce
the same restored data size
Both HDL design produce
the same restored data
output
Decompression Processor Core Test #2 - #4
// …
I.7
Decompression Processor Core Test #5
I.7.1
Test Stimulus Description
Input Data i.e. Interface_Data_In = 0x11111110, 0x98413C05, 0xFFF87F99
176
I.7.2
Simulation Waveform of Verilog Design Using IP Memories
Figure I.7: Simulation waveform of Verilog design using IP memories
decompression core test #5
I.7.3
Simulation Waveform of Verilog Design Using Generic Memories
Figure I.8: Simulation waveform of Verilog design using generic memories
decompression core test #5
I.7.4
Simulation Result Comparison
Table I.4: Simulation result comparisons between IP and generic memory design for
decompression core test
IP Memory
Restored Data Size,
Poll_Amount_Out
0x80, 0x10
Generic
Memory
0x80, 0x10
Restored Data,
Interface_Data_Out
0x22221111,
0x11111111,
0x11112222,
0x22221111,
0x00003333
0x22221111,
0x11111111,
0x11112222,
0x22221111,
0x00003333
Note
Both memory design
produce the same restored
data size
Both memory design
produce the same restored
data output
177
I.8
Decompression Processor Core Test #6 - #8
// …
I.9
Decompression Processor Core Test #9
I.9.1
Test Stimulus Description
I.9.2
Input Data i.e. Interface_Data_In = 0x001F0D0C, 0xFFFFFC3F
Simulation Waveform of VHDL Design
Figure I.9: Simulation waveform of VHDL design decompression core test #9
I.9.3
Simulation Waveform of Verilog Design
Figure I.10: Simulation waveform of Verilog design decompression core test #9
I.9.4
Simulation Result Comparison
Table I.5: Simulation result comparisons between VHDL and Verilog design for
decompression core test
Restored Data Size,
Poll_Amount_Out
Source
Data
0x80, 0x10
VHDL
Verilog
Note
0x80, 0x10
0x80, 0x10
Both HDL
design
produce the
same restored
178
Restored Data,
0x61616161, 0x61616161,
Interface_Data_Out 0x61616161, 0xXXXXXXXX,
0x61616161, 0xXXXXXXXX,
0x61616161, 0xXXXXXXXX,
0x00006161 0x0000XXXX
0x61616161,
0x61616161,
0x61616161,
0x61616161,
0x00006161
data size
VHDL design
does not
produce the
correct
restored data
compared to
original
source.
However,
Verilog design
does produce
the correct
restored data
compared to
original source
APPENDIX J
HARDWARE TEST DETAILED RESULTS
This appendix presents the detailed results of the data compression system
test using the hardware evaluation platform. Two sets of data are used in the test.
The first data set is a stream of random symbols, while another set is a stream of
redundant symbols. From each data set, 10 test stimuli are generated by choosing
different data size for each stimulus. The input test stimuli will be compressed first
and the compressed data are then restored to its original form. Therefore, output of
the hardware test should match the input test stimuli if the data compression system
hardware is functionally working.
Note, however, that the full hardware evaluation test results are not given
here but can be obtained from the author or the supervisor.
J.1
Random Data Hardware Test #1
180
Figure J.1: Compression system hardware test result for random data test #1
J.2
Random Data Hardware Test #2 - #10
// …
J.3
Redundant Data Hardware Test #1
181
Figure J.11: Compression system hardware test result for redundant data test #1
J.4
// …
Redundant Data Hardware Test #2 - #10
© Copyright 2026 Paperzz