The Erlangen Slot Machine An FPGA-Based Partially

The Erlangen Slot Machine An
FPGA-Based Partially
Recongurable Computer
Der Technischen Fakultät der
Universität Erlangen-Nürnberg
zur Erlangung des Grades
D O K T O R - I N G E N I E U R
vorgelegt von
Mateusz Majer
Erlangen 2011
Als Dissertation genehmigt von
der Technischen Fakultät der
Universität Erlangen-Nürnberg
Tag der Einreichung:
19.10.2010
Tag der Promotion:
20.01.2011
Dekan:
Berichterstatter:
Prof. Dr.-Ing. Reinhard German
Prof. Dr.-Ing. Jürgen Teich
Prof. Dr. Dr. h.c. mult. Manfred Glesner
3
4
Abstract
Partial reconguration is a special case of device conguration that allows to
change only parts of a hardware circuit at run-time. Only a predened region
of an FPGA is updated while the remainder of the device continues to operate
undisturbed. This is especially valuable when a device operates in a missioncritical environment and cannot be disrupted while a subsystem is redened
for performance or exibility reasons. The triggering of partial reconguration
can be instigated by user requests, detected changes of environmental factors
or operating system scheduling. It oers a novel possibility to dynamically load
and execute hardware modules, previously only known for software modules.
Partial reconguration is useful in increasing the computational exibility and
eciency by time-sharing the existing memory and logic resources on the device.
Using partial reconguration, the functionality of a single FPGA is increased,
allowing fewer or smaller FPGA devices to be used. Embedded systems using
FPGAs supporting partial reconguration can be customized in their hardware
at run-time with partial reconguration.
However, the design ow and pe-
ripheral I/O architectures of these devices are not ideally suited for run-time
recongurable application development.
Therefore, the benets of partial re-
conguration used in hardware designs are currently seen as limited.
The Erlangen Slot Machine (ESM) is introduced as a new FPGA-based dynamically recongurable computer architecture supporting run-time customization
through the use of partial reconguration at its architectural level. Built within
the DFG priority program 1148 Recongurable Computing its main goals are:
•
making partial recongurable designs viable for real-world applications,
5
•
operating system support for scheduling, placement and run-time reconguration of partially recongurable modules,
•
tool support for the development of run-time recongurable computation
and communication modules using new inter-module communication paradigms, and to
•
provide a platform for interdisciplinary research on algorithms, methods,
and applications using run-time reconguration.
Its architectural support for partial recongurable modules simplies the design
and evaluation of modular and partially recongurable applications.
Its key
benet is the decoupling of all peripheral I/O pins from the FPGA through
the use of an external crossbar. This feature enables exible signal routing to
any recongurable region on the FPGA and eectively decouples the peripheral
I/Os from the xed FPGA pins. Moreover, it provides a exible platform for
run-time allocation models, real-time aspects and operating systems research
for run-time recongurable systems.
The design ow tool SlotComposer automates the creation of partially reconguration modules. It allows the automated insertion of inter-module communication structures. Moreover, it aids partial module placement with graphical
visualization and creates design ow scripts for partial bitstream synthesis.
As an application example using partial run-time reconguration, an advanced
video application was implemented on the ESM platform. To support real-time
video processing in the application, methods for hardware-software communication, hardware task placement, inter-module communication and decoupled
peripheral I/O access were analyzed and implemented for use on the ESM platform.
6
Deutscher Titel und
Zusammenfassung
Die Erlangen Slot Machine Eine
partiell rekongurierbare FPGA-basierte
Computerarchitektur
Kurzzusammenfassung
Partielle Rekonguration ist ein Spezialfall der FPGA-Konguration, bei der
zur Laufzeit eine vordenierte FPGA-Region mit einer neuen Schaltung geladen
wird, während dabei die übrigen Regionen des FPGAs nicht gestört werden. Dies
ist besonders erwünscht, wenn Geräte in einer kritischen Umgebungen arbeiten
und ihr laufender Betrieb nicht unterbrochen werden darf. In diesem Fall erlaubt
die partielle Rekonguration die Schaltungen von Teilsystemen im laufenden
Betrieb auszutauschen, um die Ezienz und die Flexibilität der Schaltung, aufgrund von wechselnden Anforderungen oder variierenden Umgebungsfaktoren,
zu verbessern.
7
Die Verwendung der partiellen Rekonguration erhöht die Funktionalität und
Flexibilität eines einzelnen FPGAs, so dass kleinere und somit günstigere FPGABausteine verwendet werden können. Eingebettete Systeme mit FPGAs könnten
damit im laufenden Betrieb an sich wechselnde Anforderungen in Echtzeit angepasst werden, wodurch die Implementierung verschiedener Anforderungen in einem einzigen Baustein zusammengelegt werden kann. Allerdings haben verschiedene Module unterschiedliche Anforderungen an die I/O- und Speicherschnittstellen, welche von aktuellen FPGA-Plattformen nicht berücksichtigt werden
und damit die Entwicklung von rekongurierbaren Anwendungen erschweren.
Diese Einschränkungen haben dazu geführt, dass im Bereich der partiellen Rekonguration nur wenige Beispiele die praktische Anwendbarkeit der partiellen
Rekonguration zeigen.
Die Erlangen Slot Machine (ESM) ist eine neuartige FPGA-basierte, dynamisch
rekongurierbare Computerarchitektur, die für den Einsatz von partieller Rekonguration konsequent ausgelegt wurde. Ihre exible Architektur vereinfacht
die Entwicklung und Evaluierung von modularen und partiell rekongurierbaren
Hardware-Designs. Ihr groÿer Vorteil ist die Entkopplung aller peripheren I/OPins durch den Einsatz einer externen Crossbar. Diese ermöglicht eine exible
Signalverteilung zu jeder rekongurierbaren Region auf dem FPGA, wodurch die
peripheren I/Os von den physikalischen FPGA-Pins entkoppelt werden. Darüber
hinaus bietet die ESM eine exible Plattform für Entwicklung und Analyse von
Scheduling, Platzierungsverfahren und Echtzeitbetriebssystemen für laufzeitrekongurierbare FPGA-Systeme im Allgemeinen.
Mit dem Design-Flow Werkzeug SlotComposer wird die automatische Erstellung
von partiell rekongurierbaren Modulen verwirklicht. Es ermöglicht das automatisierte Einfügen von Kommunikationsverbindungen zwischen partiellen Modulen, die graphische Platzierung von partiellen Modulen, als auch das Erstellen
von Design-Flow Skripten für die Kongurationsdaten-Synthese der partiellen
Module.
Als Anwendungsbeispiel für die partielle Rekonguration wurde eine erweiterte
Video-Anwendung, die ein Assistenzsystem für die Erkennung von vorausfahrenden Fahrzeugen und Fahrbahnmarkierungen, auf der ESM-Plattform vollständig implementiert. Zur Unterstützung der Echtzeit-Videoverarbeitung mit partiell rekongurierbaren Videoltern wurden Methoden für Hardware-SoftwareKommunikation, Modul-Platzierung, Inter-Modul Kommunikation sowie Zugri
auf die I/O Pins der Peripherieschnittstellen entwickelt.
8
Acknowledgments
First and foremost, I would not have begun nor been able to complete this work
without the love, support, and encouragement of my partner Meline, my family
and my friends. Without them, this dissertation would not have been possible.
Moreover, I am indebted to my PhD adviser Prof. Jürgen Teich for supporting this exciting course of research and for advising on this dissertation.
His
vision, enthusiasm, and expertise motivated me as much as I beneted from
his open support for the Erlangen Slot Machine endeavor. Thanks to my external committee members Prof. Manfred Glesner, Prof. Robert Weigel, and
Prof. Wolfgang Schröder-Preikschat. Moreover, special thanks go to Prof. Sándor Fekete and Jan van der Veen for their assistance and great collaboration on
the conceptual part of the Erlangen Slot Machine and algorithmic part of the
ReCoNodes project.
I have had a great deal of assistance from the sta, students and visitors to the
Department of Computer Science 12. In particular, I thank Hritam Dutta, Josef
Angermeier, Ali Ahmadinia, Christophe Bobda, Jan van der Veen, Dirk Koch
and Thilo Streichert for reviewing, discussing and helping me to clarify many
aspects of this work. Big thanks go also to Ulrich Batzer, Matthias Kovatsch,
Jan Grembler,André Linarth and Thomas Haller, without whom my work would
not exist in this form.
ReRecongurable Com-
Furthermore, this work was supported by DFG grant TE 163/14-2, project
CoNodes [1, 2],
puting Systems
funded within the priority program 1148,
[3].
I would also like to acknowledge the DFG for providing
9
additional support to build 20 prototypes of the ESM boards.
And a special
thanks goes to Patrick Lysaght at Xilinx for his great support.
As the development of the Erlangen Slot Machine platform [4] was a huge task,
it would have been impossible without joint work on dierent elds:
Ulrich Batzer
Taillight recognition demonstrator
[5]
Matthias Kovatsch
Taillight recognition demonstrator
[6]
Bruno Kleinert
Reconguration manager driver
[7]
Thomas Stark
Crossbar software driver
[8]
Plamen Shterev
SlotComposer design ow
[9]
Jan Grembler
Video demonstrator
[10]
Christian Freiberger
Reconguration manager
[11]
Felix Reimann
RMB communication
[12]
Peter Asemann
PowerPC board support package
[13]
André Linarth
ESM Motherboard
[14]
Thomas Haller
ESM Babyboard
[15]
I feel indebted to all persons involved in this great project and would like to
thank them again for their great work.
Mateusz Majer
München, July 2010
10
Contents
Abstract
5
Deutscher Titel und Zusammenfassung
7
Acknowledgments
9
1. Introduction
15
1.1.
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
1.2.
Contributions
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
1.3.
Overview
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
2. Background
27
2.1.
What is Recongurable Computing?
. . . . . . . . . . . . . . .
27
2.2.
Recongurable Hardware . . . . . . . . . . . . . . . . . . . . . .
28
2.2.1.
Fine-Grained Architectures
. . . . . . . . . . . . . . . .
29
2.2.2.
Coarse-Grained Architectures
. . . . . . . . . . . . . . .
36
2.2.3.
Congurable Processors
. . . . . . . . . . . . . . . . . .
38
2.2.4.
Related Computing Platforms . . . . . . . . . . . . . . .
39
2.3.
Partial Reconguration . . . . . . . . . . . . . . . . . . . . . . .
39
2.4.
Technical Advantages and Limitations
41
. . . . . . . . . . . . . .
3. The Erlangen Slot Machine
45
3.1.
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
3.2.
Communication Models . . . . . . . . . . . . . . . . . . . . . . .
48
3.3.
Implemented Architecture
. . . . . . . . . . . . . . . . . . . . .
52
3.4.
The Babyboard . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
3.4.1.
Main FPGA . . . . . . . . . . . . . . . . . . . . . . . . .
59
3.4.2.
The Reconguration Manager . . . . . . . . . . . . . . .
60
The Motherboard . . . . . . . . . . . . . . . . . . . . . . . . . .
67
3.5.1.
PowerPC
. . . . . . . . . . . . . . . . . . . . . . . . . .
69
3.5.2.
Crossbar . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
3.5.
11
Contents
3.5.3.
Video Input . . . . . . . . . . . . . . . . . . . . . . . . .
74
3.5.4.
Video Output . . . . . . . . . . . . . . . . . . . . . . . .
75
4. Development of Partially Recongurable Modules
77
4.1.
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
4.2.
Partial Design Flow . . . . . . . . . . . . . . . . . . . . . . . . .
79
4.3.
The SlotComposer
83
4.4.
Operating System Framework
4.5.
Real-time Recongurable Hardware Task Management
4.6.
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . .
90
4.5.1.
Hardware Task Generation . . . . . . . . . . . . . . . . .
94
4.5.2.
Design Flow . . . . . . . . . . . . . . . . . . . . . . . . .
95
Hardware Interfaces for Video Processing . . . . . . . . . . . . .
101
4.6.1.
Overview
. . . . . . . . . . . . . . . . . . . . . . . . . .
101
4.6.2.
HW/SW Communication . . . . . . . . . . . . . . . . . .
103
4.6.3.
Video Input . . . . . . . . . . . . . . . . . . . . . . . . .
103
4.6.4.
Video Output . . . . . . . . . . . . . . . . . . . . . . . .
104
4.6.5.
Memory Interfaces
104
. . . . . . . . . . . . . . . . . . . . .
5. Application Scenarios and Use Cases
107
5.1.
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
107
5.2.
Real-Time Video Processing on the ESM . . . . . . . . . . . . .
109
5.2.1.
Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . .
109
5.2.2.
Main FPGA Partitioning . . . . . . . . . . . . . . . . . .
111
5.3.
5.4.
Implemented Video-Engines
. . . . . . . . . . . . . . . . . . . .
112
5.3.1.
Basic Video Filters
. . . . . . . . . . . . . . . . . . . . .
112
5.3.2.
Edge-Engine . . . . . . . . . . . . . . . . . . . . . . . . .
115
5.3.3.
Taillight-Engine . . . . . . . . . . . . . . . . . . . . . . .
116
A Point-Based Rendering Application . . . . . . . . . . . . . . .
124
5.4.1.
Background . . . . . . . . . . . . . . . . . . . . . . . . .
125
5.4.2.
Rendering Pipeline
128
5.4.3.
Implementation Results
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
6. Conclusions
133
137
6.1.
Summary of Contributions . . . . . . . . . . . . . . . . . . . . .
137
6.2.
Interdisciplinary Research Platform . . . . . . . . . . . . . . . .
140
6.3.
Future Work
142
A. Glossary
12
86
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
145
Contents
B. Technical Specication of the ESM
149
List of Figures
149
List of Tables
156
Bibliography
157
Curriculum Vitae
177
13
Contents
14
1. Introduction
Over the years, embedded systems designers have used dierent approaches to
design systems in ways that optimize and customize hardware to t the specic
requirements of the application they are developing. These approaches fall into
software, recongurable hardware and user-specied hardware categories.
Recongurable hardware devices are hardware devices in which the functionality
of the logic gates is customizable at run-time. The connections between the logic
gates are also congurable. Memories are used as look-up tables to implement
the universal gates, and are used to control the conguration of the switches in
the interconnection network. The program that indicates the functionality of
each gate and the switch state is called a conguration.
Field-Programmable Gate Arrays
(FPGAs) are the most common type of re-
congurable hardware devices. Their functionality is set at power-up and can
be changed during run-time.
User-specied hardware is used to create custom physical silicon to implement
the target device. This ranges from a minimal eort such as a gate array to a
fully-customized device with handcrafted features, known as
Integrated Circuits
and is immutable.
(ASICs).
Application Specic
Their functionality is set during manufacturing
However, the long development process, very high setup
costs preposition this approach for high volume applications only.
One currently new approach for compute intensive applications is stream computing. It uses parallel software programming languages, that target massively
parallel processor arrays, such as
Graphics Processing Units
(GPUs).
How-
ever, this approach is currently not suited for embedded applications because
15
1. Introduction
its power consumption of more than 100 Watts is too high for most embedded
systems [16, 17].
Applications implemented in hardware devices display eciency in concurrent
applications, achieved by using multiple parallel processing blocks.
Coupled
with their exibility to allow the embedded systems designer to tailor the device
to match their application's demands as closely as possible, hardware devices
achieve the highest possible throughput. The per-block power of an FPGA may
now be well be below that of DSPs, even though the chip-level power dissipation
is higher. DSPs typically consume 3-4 Watt and FPGAs 7-10 Watt but FPGAs
can often handle 10x the processing load by using multiple parallel processing
blocks [18].
Recongurable Computing
(RC) has started with the advent of FPGAs and
hardware-oriented design languages like VHDL and Verilog.
They enable a
10x to 100x gain over a conventional microprocessor in performance and functional density (operations per area-time) [18]. The advantage of recongurable
computing comes from highly parallel data paths and post production programmability which allows data ows to be highly specialized to the application. Moreover, partial dynamic reconguration enables run-time specialization
which brings about software like exibility to the hardware domain.
Recongurable architectures can re-adapt the behavior of their hardware resources to a specic computation that needs to be performed. Computing using
recongurable architectures provides an alternate paradigm to utilize available
logic resources on the chip analog to software multithreading.
However, the
performance gains obtained by the use of recongurable devices are limited
as development complexity and system integration costs increase.
Moreover,
programming hardware devices remains dicult, usually requiring a hardwareoriented language such as Verilog or VHDL. Hardware solutions can take an
order of magnitude longer to code and verify than software solutions which impacts development costs and increases time to market. New high-level languages
like Impulse-C or Mitron-C can shorten the development time, but they need
further development to match VHDL's eciency [19].
All user-programmable features inside an FPGA are controlled by memory cells
that are volatile and, therefore, must be congured on power-up. These memory
cells are known as the conguration memory, and dene the
(LUT) , signal routing,
16
Input/Output Blocks'
Look-Up Tables
(IOBs) voltage standards, and all
Figure 1.1.:
The architecture of the Xilinx Virtex family of FPGAs allows design
modules to be swapped on-the-y using a
Partial Reconguration
(PR) methodology [20, 21]. Each partial module is placed in a predened area called PR region. This allows multiple design modules
to time-share resources on a single device, while the base design and
and all external links continue to operate uninterrupted.
other aspects of the design. To program the conguration memory, instructions
for the conguration control logic and data for the conguration memory are
provided in the form of a synthesized bitstream. Once an FPGA is programmed
it can be then partially recongured using a partial bitstream.
Partial Reconguration
(PR) is useful for systems with multiple functions that
can time-share the same FPGA device resources. In such systems, one section of
the FPGA continues to operate, while other sections of the FPGA are disabled
and partially recongured to provide new functionality. Partial reconguration
is used to change the structure of one part of an FPGA design, while the rest of
the device continues to operate undisturbed. This is analogous to the situation
where a microprocessor manages context switching between software processes.
In the case of partial reconguration of an FPGA, it is the hardware logic that
is being switched. Partial reconguration provides an advantage over multiple
recongurations in applications that require continuous operation not otherwise
accessible during reconguration.
One example, illustrated in Figure 1, is a
software dened radio system. Because of the environment in which this application operates, signals from radio and video links need to be preserved but at
the same time the data processing format requires updates and changes during
operation. The underlying premise of this thesis is that with partial recongura-
17
1. Introduction
tion, the system can maintain these real-time links while other modules within
the FPGA are changed on-the-y [20, 21].
The reconguration process can be classied whether only the whole device is
programmed as one entity only once (static full reconguration), or whether
just parts of the device are recongured at run-time (partial reconguration).
Before an FPGA is operational after reconguration, a certain time elapses,
reconguration time.
often called
These dierent terms of reconguration are
illustrated inf Figure 1. The partial reconguration of individual slots achieves
a higher exibility and reduces reconguration times (gray areas).
Reconfiguration time
Execution time
Slot 6
Slot 5
Slot 4
A
Slot 3
Slot 2
Slot 1
time
a) Static full reconfiguration
Slot 6
Slot 5
Slot 4
A1
Slot 3
A2
A1
A2
Slot 2
Slot 1
time
b) Run-time full reconfiguration
M8
Slot 6
Slot 5
M5
M6
M9
M5
Slot 4
M10
M7
M4
M5
Slot 3
Slot 2
M1
M2
M1
M2
M3
Slot 1
c) Run-time partial reconfiguration
Figure 1.2.:
time
Dierent reconguration modes supported by the ESM platform: a)
static full reconguration, b) run-time full reconguration, and c)
run-time partial reconguration.
18
1.1. Motivation
1.1. Motivation
Despite the announcement made by several companies in the last couple of years
about the design and production of new and mostly coarse-grained recongurable chips [22, 23, 24], the dominant part of today's recongurable computing
platforms are still ne-grained and FPGA-based.
The growing capacities provided by FPGAs as well as their partial reconguration capabilities allow them to implement complex digital designs.
Xilinx
FPGAs [25, 26, 27, 28] combine the advantages of large capacity and the ability
to support partial reconguration. The Virtex-II series oers enough logic for
eciently implementing applications with high demand of resources, e.g., arising
in video, audio and signal processing as well as in other elds like automotive
applications.
There are, however, open problems concerning module relocation: In order to
connect a module to other modules and/or pins, signals are often required to
pass through other modules. Those signals used by a given module and crossing
other modules are called feed-through signals. Using feed-through lines to access
resources has, however, two negative consequences, as illustrated in Figure 1.3:
•
Diculty of design automation: Each module must be implemented with
all possible feed-through channels needed by other modules. Because designers only know at run-time which module needs to feed through a signal,
many channels reserved for a possible feed-through become redundant.
•
Relocation of modules:
Modules accessing external pins are no longer
relocatable, because they are compiled for xed locations where a direct
signal line to these pins is established.
Many FPGA-based recongurable platforms such as [29, 30, 31, 32] oer various interfaces for audio, video capturing and rendering and for communication.
However, each interface is connected to the FPGA using dedicated pins at xed
locations. Modules with access to a given interface such as a VGA input port
must be placed in the area of the chip where the FPGA signals are connected.
If the input or output signals are not grouped together then the relocation of
these modules becomes impossible. Until now, no platform on the market has
provided a solution to these problems.
19
1. Introduction
Figure 1.3.:
The feed-through line problem with relocatable modules. Placing a
new module B into slot two requires that the new module provides all
feed-through lines needed by slot one and three. This fact disables
any module relocation and makes it impossible to place modules
with dierent feed-through requirements into the other slots.
The most important problems limiting the use of partial and dynamic reconguration are:
•
limited support for partial reconguration,
•
I/O-pin dilemma,
•
inter-module communication dilemma and
•
local memory dilemma.
These limits of existing FPGA-based recongurable computers are explained in
detail in Section 3.4:
Very few FPGAs allowing partial reconguration exist on the market.
These
few FPGAs, like the Virtex series by Xilinx [25], impose nonetheless some restrictions on the least amount of resources that can be recongured at a time,
for example column-wise reconguration.
Many existing platforms include I/O peripherals like video, RAMs, audio, ADC
(analog to digital converter) and DAC (digital to analog converter) connected at
xed pins of the FPGA device. As a consequence of these pin constraints, partial
reconguration may be dicult or even impossible, because a new module can
20
1.1. Motivation
require access to dierent I/O pins.
Another problem related to pins is that
the pins belonging to a given logical group like video, and audio interfaces are
not situated closely to each other. On many platforms, they are spread around
the device. A module accessing an external device will have to feed many lines
through many dierent components. This situation is illustrated in Figure 1.4:
Two modules (one of which is a VGA module) are implemented.
The VGA
module uses a large number of pins at the bottom part of the device and also
on the right hand side. Implementing a module without feed-through lines is
only possible on the two rst columns on the left hand side. The eort needed
for implementing a recongurable module on more than two columns together
with the VGA module is very high. FPGA development boards from Celoxica
Ltd. [29], Alpha Data Ltd. [30], XESS Corp. , and Nallatech Inc. all exhibit the
same limitations. On the XF-Board [33, 32] from ETH Zurich, the peripherals
are connected to one side of the device. Each module accesses I/Os through an
operating system (OS) layer implemented on the left and right part of the device.
Many other existing platforms like the RAPTOR board [34], Celoxica RC1000
and RC2000 [29] are PCI systems that require a workstation for operation. The
use in stand-alone systems as needed in many embedded systems is not possible.
Modules placed at run-time on the device typically need to exchange data among
each other. Such a request for communication is dynamic due to run-time module placement. Dynamically routing signal lines on the hardware is a very cumbersome task. For eciency reasons, new communications paradigms must be
investigated to support such dynamic connection requests, for example packetbased DyNoCs [35] or principles of self-circuit routing.
Modules requiring large amounts of local memory cannot be implemented since
a module can only occupy the memory inside its physical slot boundary. Storing data in o-chip memories is therefore the only solution. However, existing
FPGA-based platforms often have only one or two external memory banks and
their pin connections are spread over the borders of the FPGA.
The design and implementation of a recongurable computing platform poses
many challenging problems. Motivation to research these challenging problems
is reected in the following topics and tasks, especially through related an relevant research questions:
Hardware Support
How should the hardware device be partitioned so that
multiple independent tasks can execute?
Can multiple I/O streams be
21
1. Introduction
Figure 1.4.:
Pin distribution of a VGA module on the RC200 platform. It can be
seen that the VGA Module occupies pins on the bottom and right
FPGA borders. In consequence, only a narrow part on the left side
is available for dynamic module reconguration.
supported? How hard is it to access external memory? How is conguration data manged and who is controlling the reconguration process?
Task Design
Each task has to communicate with external I/Os or with other
running tasks.
How does the task development process support inter-
module and external communication? What are global requirements for
supporting arbitrary task placement?
How should relocatable tasks be
designed so they do not interfere with neighboring tasks? How can tools
automate the development process?
OS Framework
Basic operating system services are needed for run-time schedul-
ing and placement.
How can the overhead of the operating system be
minimized? Should the operating system itself be a hardware task or run
on a separate microprocessor? Additionally, the lack of advanced software
tools is a signicant bottleneck in application development with partial
reconguration support.
To support parallel execution of hardware tasks analog to software multitasking a well dened methodology for the development of these tasks has to be
established. At the operating system level hardware resources, caching of conguration data for each hardware task, access to global memory and communication resources must be eciently manged. These challenging problems generate many questions that need to be solved in order to enable the creation of
a exible recongurable platform, as illustrated in Figure 1.5.
22
The operating
1.2. Contributions
system manages the recongurable hardware by providing an abstraction layer
between task request and the recongurable hardware device as illustrated in
Figure. Each task request fetches the corresponding module conguration from
a module database. The scheduler determines the exact point in time for the
module to be loaded into the hardware device [36]. However, this can only be
performed if the placer can nd a free region that the module can t in. Moreover, the number of free regions can be further limited through defragmenting
of the device area which is caused by frequent loading of new modules [37].
Based on the above review of current platform capabilities, issues, and questions,
this thesis contends that the present underuse of partial dynamic reconguration
is due in great part to a lack of a standardized development environment and
a common operating system framework platform to support its key technology
benets.
Figure 1.5.:
Overview of a recongurable computing platform.
The recong-
urable hardware device is controlled by an operating system which
loads partial tasks on request.
1.2. Contributions
The
Erlangen Slot Machine,
a new FPGA-based partially recongurable plat-
form, overcomes the I/O bottleneck of existing FPGA-based platforms by implementing a crossbar oriented peripheral I/O architecture and dedicated external
23
1. Introduction
memory for up to six partial modules [38, 39, 40, 41, 42]. This recongurable
platform also includes an external processor for main control and a dedicated
FPGA for reconguration management [43].
These architectural features o-
load partial reconguration support and management functions form the main
FPGA to external devices [44]. This enables the sole use of the main FPGA for
partial recongurable modules. Another resulting feature is a simplied development process as static control logic does not intervene with the development
of run-time recongurable tasks. Thus, up to 22 partial hardware tasks can be
loaded on demand while satisfying peripheral I/O access of each partial task
through the external crossbar [45].
Also new is the introduction of a tool support for an automated transformation
of hardware designs into partial hardware designs at the HDL-level.
When a
part of an application is moved to a partial hardware module, the design ow
was found to be very time consuming and error prone because a new top-level
module with intermediate communication modules and signals had to be created. Therefore, a software tool called
SlotComposer
was developed to generate
communication glue logic needed for partial reconguration at the top HDLlevel. The use of platform tailored communication schemes further reduces the
development time [46, 47].
The complexity of implementing a fully working video application on an FPGA
is high, especially if external memories and peripheral I/Os are used. To prove
the ESM concept's practicability for complex applications using partial reconguration, a video processing application for lane and object detection for a
driver assistance system was successfully implemented [48, 49].
This applica-
tion utilizes real-time partial reconguration and all features of ESM platform.
Our software tool
hwtaskgen
generates a set of partial recongurable hardware
tasks for benchmarking purposes.
Each generated partial task has a simple
communication interface with the operating system rmware running on the
PowerPC. The execution time and the physical size of each task is specied
before its generation and is therefore xed at design time. These features enables
the comparison of time overheads and dierent scheduling strategies for partial
reconguration on various FPGA platforms.
The second application implements a point rendering pipeline on the ESM platform [50]. Point rendering is an alternative 3D rendering scheme based on point
clouds instead of traditional triangle meshes. The software part of the application controls the rendering-pipeline in real-time and is used to precompute
24
1.3. Overview
coecients in oating point format. The point rendering throughput of 60 million pixels per second is independent of the camera view but limited by the
memory bandwidth required to read pixels from memory.
Not included in the scope of this thesis are additional contributions to the
following aspects:
•
Dynamic NoC approach for the communication infrastructure in recongurable devices [51, 52].
•
Packet routing in dynamically changing network on chip [53].
•
Task scheduling and module-layout defragmentation for run-time recongurable architectures [54, 36, 55].
1.3. Overview
Chapter 2 describes technical advantages and limitations of recongurable computing today. The chapter begins with the promise of recongurable computing
and details the aspects such as partial reconguration and run-time environments for hardware tasks.
It then describes the underlying technology which
consists of ne-grained devices like FPGAs and coarse-grained devices. The former consists of
the latter uses
Congurable Logic Blocks (CLBs) operating
a sea of Arithmetic Logical Units (ALUs).
at bit level, while
Then, Chapter 2
details existing recongurable computing platforms and their limitations.
Chapter 3 presents the platform, hardware task and operating system models
upon which the
Erlangen Slot Machine
(ESM) is based. The chapter describes
the inter-module communication problem and provides several solutions, all of
which were implemented on the ESM platform. Then Chapter 3 presents the
ESM platform which resolves the limitations of existing recongurable platforms
and describes the physical implementation of the ESM Motherboard and Babyboard. Finally, a exible reconguration management architecture is detailed
and workload scenarios are presented.
Chapter 4 describes the development tools which were implemented to support
the partial module development. It also depicts the operating system framework
which controls the execution of hardware tasks at run-time.
25
1. Introduction
Chapter 5 reports on application scenarios which were implemented on the ESM
platform. The main application domain is video processing. In the rst application a real-time video processing architecture for a driver assistance application
is presented.
The second application uses the ESM for real-time point-based
rendering of 3D images.
Chapter 6 concludes the thesis with a review of the results, their signicance,
and provides directions for further study. The appendix contains a glossary and
the technical specication of the ESM platform.
26
2. Background
2.1. What is Recongurable Computing?
The promise of recongurable computing is to deliver high performance acceleration for the domain of compute intensive applications which are implicitly
suited for pipelining and parallel execution.
FPGA-based systems are commonly used in recongurable computing because
of their hardware reconguration, application performance, and wide spread
availability. In most common scenarios, FPGAs are used in high performance
computing or in low volume, high-end hardware devices, like backbone Internet
routers or ASIC emulators. Traditionally, FPGAs are being used as glue logic
between various I/O standards or interfaces.
With the help of hardcore and
softcore processors, FPGAs begin to enter the embedded market by integrating
I/O devices, memory controllers and microprocessors into one device.
This
positions them directly against established System-on-Chip solutions for low to
mid volume quantities, as lower FPGA prices and higher gate counts for each
new generation help to drive this change.
Systems using FPGAs retain the execution speed of dedicated hardware but
also have software like functional exibility. The logic within the FPGA can be
changed if or when it becomes necessary. Bug xes and functionality upgrades
can be applied as easily as their software counterparts. For example, releasing
a new WLAN access point with a pre-draft specication is feasible with a system based on recongurable hardware. When the nal draft is nalized, then
the internal logic can be redesigned to reect the changes and automatically
27
2. Background
uploaded to the system. After the next system start the device will be able to
use the new version of the protocol.
Recongurable computing involves manipulation of the logic within the FPGA
at run-time. In other words, the design of the hardware may change in response
to the demands placed upon the system while it is running. Here, the recongurable hardware acts as an execution engine for a broad range of hardware
task, in the same manner as a microprocessor acts as an execution engine for a
variety of software threads. This allows the system designer to t more hardware tasks on the chip than physically possible, which works especially well
when some hardware tasks are occasionally idle.
One application example is
a smart surveillance camera that supports multiple video denoising lters and
multiple object trackers.
Depending on weather and lighting conditions, the
most appropriate components are selected and recongured by the operating
system on-the-y. This enables the camera to deliver consistent performance at
reasonable device costs while operating in a changing environment.
2.2. Recongurable Hardware
What exactly is recongurable hardware and how does it compare to a standard microprocessor? In both cases, their xed physical functionality consist of
transistors and wires built on a silicium substrate. Internal memory elements
are used to program the functional units and interconnect structures to form an
instruction specic data path. This data path controls the data source and sink
for each functional unit found on the device.
The main dierence is the frequency with which the functional units change
their behavior, the number of functional units, and the programmable interconnect dier signicantly, as shown in Table 2.1.
Basically, the microprocessor
programs its functional units with every instruction it processes.
Being a se-
quential machine, its objective is to process as many instructions in as few clock
cycles possible. However, each instruction must be fetched from external memory, decoded, executed, and nally the result must be stored.
On the other
hand, a recongurable hardware device tries to process as much data in parallel
as possible using as few instructions as possible. This is achieved through a very
high number of small and simple functional units as well as an extensive and
programmable interconnect. Recongurable devices are programmed only once
28
2.2. Recongurable Hardware
at start-up to provide an application specic parallel data path until they are
powered down.
On one side, the microprocessor is built to process billions of instructions per
second with inherently exible and sequential conditional data ow.
On the
other side, a recongurable device can process billions of data words with one
programmed conguration. Both models have their advantages, as the sequential compute model is better suited for control intensive applications, which
on the other hand are not suited for massively parallel architectures found in
recongurable devices.
Key Parameters
Recongurable Devices
Number of
Functional Units
Instructions per
second
Typically 64, up to hundreds of
thousands
Few, as up to Billions of data
words are processed and not
instructions
Parallel data computation with
high-performance custom memory
I/O architecture
Computation
paradigm
Table 2.1.:
Microprocessors and
DSPs
Typically 32
Billions
Sequential
instruction
processing
Conceptual dierences between recongurable hardware and microprocessors depicted with the help of architectural key parameters.
2.2.1. Fine-Grained Architectures
The most successful recongurable device is the FPGA, which stands for
Programmable Gate Array.
Field-
Its programmable fabric consists of an array of ne-
grained logic blocks that operate on the bit level. The array and the interconnect
structure are illustrated in Figure 2.1.
The chip area of an FPGA consists of
Congurable Logic Blocks
(CLBs) ar-
ranged in a mesh structure, as shown in Figure 2.1 and 2.2. Each CLB contains
several
slices
and is connected to a switch box which enables distance and local
connections to other CLBs. Each slice insides the CLB is a self contained logic
block with two
Look-Up Tables
(LUTs) and corresponding ip-ops, as shown
in Figure 2.3. Signals used for carry signal propagation can be directly linked to
29
2. Background
upper and lower neighbor CLBs in order to allow ecient synthesis of adders.
The space between the CLBs is lled with interconnect consisting of segmented
wires and programmable switch points which occupy up to 90% of the FPGAs
chip area [56]. The edges of the chip contain
Digital Clock Managers
tecture.
Input/Output Blocks
(IOBs) and
(DCMs) as in the case of the Xilinx Virtex-II archi-
The regularity of the mesh structure is disrupted by memory blocks
and embedded hardware multiplier columns, as illustrated in Figure 2.2.
In
case of the Virtex-5 architecture [27], the IOBs are grouped into I/O banks and
distributed in special columns inside the CLB array.
The XC2064 from Xilinx [57] was introduced in 1985 and was the rst commercially available FPGA. It distinguished itself from previous programmable logic
devices through 64 congurable logic blocks and a exible interconnect between
them. Its SRAM based conguration memory denes the functionality of each
logic block and their connections but could only be programmed at start-up. In
1996 Xilinx introduced the XC6200 series [58], the rst partially recongurable
FPGA. One of today most advanced FPGAs, the Virtex-5 family [59, 27] is still
SRAM based and contains up to 330000 logic blocks coupled with dedicated
hardware blocks for I/O, memory, clock management and dedicated arithmetic
units. Moreover, it allows to reprogram parts of its logic blocks and interconnect
during run-time.
CLB
Switch box
Logic block
Routing channel
Figure 2.1.:
30
Basic logical structure of an FPGA device.
2.2. Recongurable Hardware
During the programming process of an FPGA conguration data is written into
an internal SRAM based conguration memory. The programming process is
called full conguration or reconguration because all internal elements of the
FPGA are set to a new state which implements the desired digital design. The
conguration data, also called bitstream, species the functionality of each logic
block and connections between them. Thus, every SRAM based FPGA must
be congured from an external source prior to its operation.
Partial reconguration is restricted to only a part of the FPGA device area and
can be performed only during run-time. This enables the design of computing
elements which are adaptable during run-time. Moreover, this allows to dynamically modify, replace or add system components while the remaining circuits
remain to operate undisturbed [60, 61].
Today SRAM based conguration storage dominates, although other non-volatile
technologies are available. The two main vendors are of SRAM based FPGAs
are Xilinx and Altera. Flash storage of conguration data is used for example
in ProASIC devices from Actel [62] and LatticeXP devices from Lattice Semiconductor [63]. Storing conguration data inside the FPGA in a ash memory
has the benet of instant device start-up since no data has to be loaded from
an external source as in the case of SRAM based FPGAs.
The downside of ash memory is its slow write access which is several orders
of magnitude slower than SRAM. One-time programmable anti-fuse technology
used by Actel [64] provides the most secure and restricted programming scheme
as the conguration of the FPGA cannot be changed or read back after the rst
device initialization.
FPGA's ability to work eciently on single bit signals is termed as ne-grained
recongurable hardware.
Boolean functions and nite state machines can be
implemented in parallel fashion with maximum performance on these architectures. This is due to the simplicity of each k-bit wide look-up table (LUT) inside
every congurable logic block (CLB) that can be programmed to compute every
k
k-ary Boolean function f : B → B , where B = {0, 1}. However, large word
length computations, for example greater than 128 bit, start to cause interconnect congestion problems. This manifest itselfs in timing problems and lengthy
place and route phases as thousands of functional units have to be connected
together under strict timing constraints.
31
2. Background
Figure 2.2.:
Global view of the array structure inside an Xilinx Virtex-II FPGA.
Note that the interconnect between the CLBs is not shown but comprises 80% to 90% of the total chip area [65, 56].
The Virtex-II 6000 FPGA from Xilinx [25] is the main computing engine of the
Erlangen Slot Machine (ESM) that has been built to overcome many problems of
partial reconguration mentioned in Section 1.1. This FPGA devices contains a
large number of resources on a single chip, as listed in Table 2.2. In the following
we will describe the structure and all important elements of this FPGA family.
CLB Array Size
Table 2.2.:
96*88
Number of Slices
33792
Max. Size of Distributed RAM
1.056 Mbit
Block Multipliers
144
BlockRAMs
144
Max. Size of BlockRAM
2.592 Mbit
DCMs
12
Technical data of the Virtex-II 6000 FPGA from Xilinx [25].
The FPGA contains an array of 8488 Congurable Logic Blocks (CLBs) which
is overlaid with a second sparse array of Block Multipliers and BlockRAMs,
as shown in Figure 2.2.
32
The connectivity to external devices is provided by
2.2. Recongurable Hardware
dedicated I/O blocks which are located near each I/O pin. The Global Clock
Mux an the Digital Clock Manager (DCM) are used for global clock distribution
and for clock cycle adjustments of individual areas on the chip. The interconnect
between CLBs is not shown in this gure.
Congurable Logic Block
The CLB is the main building block of each FPGA
structure. The number of CLBs located on a FPGA and the interconnect structure dene its performance and the complexity level of a logic design that can be
implemented. More CLBs allow to build more complex, parallel, and pipelined
digital designs.
The CLB itself is subdivided in smaller parts, called slices.
In the Virtex-II
FPGA family four slices are located inside each CLB and all four of them are
connected to a switch matrix and a fast connect bus.
Figure 2.3 depicts the
internal connections inside a CLB. The fast connect bus allows the direct connection of slices which are located in close proximity. Connections not supported
through the fast connect bus are routed outside the CLB through the switch
matrix.
Figure 2.3.:
Internal structure of a Congurable Logic Block and a slice element.
The left gure shows that a CLB consists of four slices and a switch
matrix for long distance connections [25]. The right gure depicts
the internal structure of a slice. It can be congured to implement
logic functions or used as a memory element.
Each slice contains
two registers (Flip-Flops).
33
2. Background
Slice
Slices are basic elements inside each CLB that implement logic functions.
For Virtex FPGAs, each slice contains two look-up tables and two ip-ops. The
ip-ops can be used to store the output of a look-up table. All logic functions
can be implemented with the help of LUTs. Boolean functions with four inputs
can be realized with a LUT by storing a complete truth table for this function.
Functions of arbitrary input width can be implemented through the concatenation of several LUTs. Because the conguration of the truth table is stored
in SRAM cells, each CLB can be congured to act as a shift register or simple memory cell. In the latter case, the term Distributed Memory is used for
simple memory cells which are based on LUTs. The Virtex-II 6000 FPGA can
implement up to 1056 Kbit of Distributed Memory on chip. However, the use of
look-up tables for memory purposes renders them unusable for the implementation of logic functions.
BlockRAM
To save logic resources, memory can be directly instantiated in
dedicated memory blocks found inside the FPGA. These memory blocks are
called BlockRAMs and have a size of 18 Kbit. The accumulated memory size
on the Virtex-II 6000 FPGA is 2592 Kbit. BlockRAMs are located in special
columns on the FPGA, as shown in Figure 2.2. Moreover, each memory block
has a dual ported address and data interface to allow two independent reads
or writes on the memory. Concatenation can be used to create larger memory
blocks.
Therefore, BlockRAMs are the best choice for the implementation of
large memory blocks as long as the required memory size can be t in and
timing constraints are met. Otherwise, external memory resources have to be
used with the drawback of higher latencies and higher power consumption for
external I/O access.
Block Multipliers
Similar to xed BlockRAM elements, the Virtex-II FPGA
family contains xed hardware multipliers. Due to their xed hardware functionality they execute very fast and do not consume any logic resources. They
are physically grouped with the BlockRAM columns, as shown in Figure 2.2.
Each Block Multiplier has a xed input size of 18 bit and the Virtex-II 6000
FPGA contains 144 multipliers.
Digital Clock Manager
Clock distribution inside the FPGA is critical feature.
The Digital Clock Manager (DCM) is a vital element of the clock net. The DCM
34
2.2. Recongurable Hardware
can synthesize a custom clock frequency with an adjustable clock phase.
Bus-macros
Bus-macros are FPGA specic hard-macros, xed logic blocks
that are pre-placed and pre-routed. They are used as xed data paths for signals
going in and out of a recongurable module as shown in the following gure [21].
The HDL code should ensure that any recongurable module signal that is used
to communicate with another module does so only by rst passing through a
bus-macro. There are device-specic versions of bus-macros.
Each bus-macro provides a xed number of bits for the inter-module communication, typically 8 or 16 bits. Also custom made bus-macros with user dened
data width are possible to design but require extensive overhead for the design
and routing of these hard-macros. The number of instantiated bus-macros must
match the number of bits traversing the boundaries of the recongurable modules. As an example, if recongurable module A communicates via 32 bits to
module B, then four (32/8) bus-macros with 8 bit will need to dene the data
paths between modules A and B.
If a signal passes through a recongurable module connecting the two modules
on either side of the recongurable module, bus-macros must be used to make
that connection.
This eectively requires creation of an intermediate signal
that is dened in the recongurable module. The signal cannot be actively used
during the time the recongurable module is being congured.
There are several dierent types of bus-macros supplied by Xilinx, allowing
designers to choose from signal directions that are left-to-right or right-to-left
for Virtex-II/Pro; left-to-right, right-to-left, top-to-bottom, or bottom-to-top
for Virtex-4 FPGA, as shown in Fig. 2.4. The physical width of the bus-macro
can also be chosen (wide - 4 CLBs wide or narrow - 2 CLBs wide) and whether
signals passing through the bus-macro are registered or not (synchronous vs.
asynchronous).
However, most vendor provided bus-macros, regardless of direction or physical
width, provide eight bits of data width and enable/disable control. This aw can
be eradicated with the use of custom made bus-macros, as used in the ReCoBus
[66] or the Caronte project [67]. However, both projects require an additional
design ow with very device specic and technology dependent libraries.
35
2. Background
L2R
L2R
R2L
R2L
Partial
Partial
Reconfigurable
Reconfigurable
Module
Module
R2L
R2L
Figure 2.4.:
L2R
L2R
Usage of bus-macros inside a Virtex-II FPGA between partial recongurable modules (PRMs) and the static base design or other
partially recongurable modules.
2.2.2. Coarse-Grained Architectures
Coarse-grained dynamic recongurable devices promise to deliver higher performance at a lower cost than FPGAs. Their goal is to increase performance for
a given application domain by reducing exibility. However, they are no longer
capable of implementing arbitrary digital circuits like FPGAs.
Similar to an FPGA, coarse-grained devices consist of an array of
Elements
Processing
(PEs) whose functions and interconnect can be changed during run-
time. A PE provides an ALU for numerical and logical calculations, logic for
shift/mask operations, a register le and multiplexers for switching the data
ow between PEs.
The processing element is called coarse-grained is no more bit oriented, as its
data path width can range from 8 bit to 64 bit. Compared to FPGAs, a coarsegrained device operates on data words and not on single bits. Therefore, the
ALU is optimized for one specic word length. This reduces the costs and the
power consumption through a smaller die size, compared to an ALU structure
implemented in an array of LUTs on an FPGA device.
The limited number
of processing elements and a restricted interconnect structure reduce exibility
but also the amount of conguration data. This leads towards faster conguration times (within a few clock cycles) of a coarse-grained architecture when
compared with FPGAs. It also enables time multiplexed execution of parallel
threads through partial reconguration.
Example of coarse-grained recong-
urable architectures are RaPID [68], Matrix, Piperench [69], ADRES [70], and
PACT XPP [71].
36
Figure 2.5 shows the array and internal PE structure of a
2.2. Recongurable Hardware
coarse-grained architecture called WPPA (weakly-programmable processor array) [72, 73]. The data path width of the PE can be set at design time, varying
from 8 bit to 64 bit. Together with interconnect customization this enables the
designer to select the most appropriate architecture for a specic application
domain. The operation of ALU, shift/mask logic, and data paths between components are controlled with instructions stored in the local instruction memory.
During execution the PE reads only its local instruction memory forgoing slow
external memory accesses. The size of the PE array can be set at design time,
from 16 to 512 PEs. On the edge of the PE array, distributed memory modules can be provided to hold streaming data.
Input/Output data is directly
transferred to/from each PE or distributed memory modules.
Figure 2.5.:
Example of a coarse-grained recongurable architecture WPPA with
parameterizable processing elements (WP PEs) [72, 73].
Dynamic reconguration can be used to enhance the area eciency by changing
PE functionality and PE interconnect structure at run-time. By using a single
PE array for multiple tasks, the chip area gains computational density and post
production exibility. The simplest way for it is to store multiple sets of conguration data in each PE and to control the switching with a global conguration
manager.
New array congurations can be performed in the background if a
dedicated conguration bus is implemented.
37
2. Background
2.2.3. Congurable Processors
In general, recongurable architectures target the acceleration of software. Depending on the application, the recongurable hardware can be loosely coupled
to a microprocessor via the processor bus or shared memory.
This approach
allows to extend a standard computer workstation by attaching an accelerator
card to the Motherboard. The downside of this approach is limited bandwidth
and high communication latency between the host processor and the recongurable device which forces the accelerator to operate with relative autonomy. In
most cases, compute-intensive data is o-loaded to the accelerator card and the
results are collected after processing, without intensive communication during
the processing phase.
Thus, only compute intensive applications can benet
from acceleration. Examples of this loose system coupling include Splash2 [74],
Celoxica RCHTX [75], ClearSpeed [75], and other PCI or PCIe based recongurable accelerator boards. Other examples are also the Cray XD1 and SGI
RC100 accelerator cards for high performance clusters. They both contain two
large coprocessor FPGAs with access to local high speed memory and custom
communication links which can be used transparently by the software applications.
In more eective schemes for closely-coupled systems, the recongurable hardware can be implemented as a coprocessor connected directly or through a
dedicated memory buer to the processor.
GARP [76], REMARC [77], and
MorphoSys [78] are examples of such architectures.
The integration of a recongurable fabric (also called recongurable functional
unit) into to the data path of a processor or embedding a microprocessor directly
into an FPGA generates a very tightly coupled system. In the rst case, the
recongurable hardware becomes a integral part of the processor architecture.
The recongurable functional unit can be congured to compute application
specic custom instructions. These instruction can be used like any other processor instructions. Through run-time conguration of the recongurable unit
new custom instructions can be created on-the-y. Examples of these architectures are Stretch S5000 [79], OneChip [80], DISC [81], Chimera [82] and MOLEN
[83].
In the second case, the microprocessor itself is embedded inside the recongurable hardware.
For example, the IBM PowerPC 405 hardcore processor is
physically embedded inside Virtex-II/Pro FPGAs [84]. Another method is to
38
2.3. Partial Reconguration
generate custom softcore microprocessor for the FPGA which can be customized
to application specic needs but occupy valuable logic resources. This is performed with optimized and FPGA specic IP-core generators which allow a high
degree of customization. Examples of such softcore microprocessors are Xilinx
MicroBlaze [85], Altera Nios-II [86], and ARM7 processor core [87] for Actel
FPGAs.
2.2.4. Related Computing Platforms
The potential to accelerate supercomputing applications motivated several projects
to explore recongurable computing systems. Similar to existing supercomputers, a large number of FPGAs were embedded in dedicated network topologies.
Two examples from the early 1990s are the Splash-II [74] and the Programmable
Active Memory (PAM) [88]. Splash contained 32 and PAM 25 FPGAs. Both
systems proved their impressive performance by outperforming standard supercomputers in several application areas [89].
The Berkeley Emulation Engine [90] is a new member of the high performance
computing arena. The current BBE2 [91, 92] FPGA based platform is designed
to be modular and scalable in order to accelerate a wide range of application
domains such as, real-time signal processing, scientic computing, large scale
simulation and emulation.
The platform is based on the BEE Motherboard
containing ve large FPGAs with high speed memory and communication interfaces. Depending on the application requirements a network of BEE Motherboards and storage modules is combined to create the recongurable computing
system. One example application is the Research Accelerator for Multiple Processors Project (RAMP) [93, 94] which emulates a thousand core multiprocessor
system where each FPGA hosts several softcore processors.
2.3. Partial Reconguration
FPGAs load their conguration from external memory during start-up or can
be made to do so by asserting a chip reset signal. This means that the FPGA
must be re-programmed entirely and its current internal state is lost. In order
to benet from concurrent hardware tasks which can be loaded independently
39
2. Background
during run-time into the FPGA, partial reconguration and read-back of the
internal hardware task state must be supported [25, 95, 21]. However, loading
tasks into the devices is a sequential process and the overhead for each task is
typically proportional to its conguration data size.
Whenever possible, a reset of the FPGA should be avoided, because a complete
new conguration has to be written to the FPGA whereby all internal states
are lost.
Partial reconguration, also known as partial dynamic or run-time
reconguration, allows partial changes of the FPGA logic without aecting the
state of other logic blocks.
This means that parts of the FPGA unaected
by the partial reconguration process continue to work without any interrupt.
Changes to a small block of logic will be always much faster than reconguration
of the entire FPGA as the reconguration overhead is proportional to the chip
area occupied by the logic block. The more conguration overhead there is, the
more likely is that the system performance will be below that of xed-hardware
version when partial reconguration is performed too frequently.
A hardware task is functional hardware component/module that contains its
own conguration and run-time dependent state information. Hardware module
relocation allows to load and execute a hardware task in any free recongurable
region. Hardware modules should be developed in a position-independent way
to be relocatable.
However, the conguration data, sometimes referred to as
bitstream or bitle, references absolute CLB positions inside the FPGA. This
requires an extra translation step to change the position information inside the
bitle to the desired reconguration region. Otherwise, each partial module has
to be synthesized in a separate process for each possible reconguration region.
To actually carry out a dynamic placement of a hardware module during runtime, a few assumptions are required. First, it is desirable to add constraints on
the size and shape of the relocatable hardware module. These constraints limit
the number of possible choices within the FPGA and make run-time placement
algorithms more ecient and eective. Second, inter-module and o-chip communication require xed communication access points that must be known at
design time of a relocatable hardware task. Hence, every hardware task should
adhere to a standard communication interface. This paves the way for greater
hardware task re-use and is especially important if a hardware task library has
to be maintained.
As most hardware tasks are comparable to functional logic blocks, it is safe
to assume that many existing hardware designs can be split and migrated to
40
2.4. Technical Advantages and Limitations
relocatable hardware tasks.
One goal to accomplish this migration in a time
eective manner is to build a thin wrapper around the existing logic block without any modications to its original behavior. This task wrapper itself is part
of a hardware task framework which is always present within the recongurable
device.
The framework itself provides inter-task communication support, ac-
cess to o-chip peripherals and external memory devices through a standard
interface.
Due to the dynamic nature of recongurable computing, it is helpful to have
software components manage various conguration processes at run-time. These
task can be divided into:
•
Deciding which hardware objects to execute, where on the device and
when.
•
Swapping of hardware tasks into and out of the recongurable logic.
•
Switching communication paths between hardware tasks or between hardware tasks and peripheral I/O devices.
This embedded software is analogous to an operating system that manages execution of multiple software threads. Like threads, hardware tasks may have priorities, deadlines, dependencies and communication/memory constraints. The
goal of the run-time environment is to organize this information and make decisions based upon it.
As there are decisions to be made while the system
is running, algorithms have to be developed for on-line scheduling and placing of hardware tasks. The on-line scheduler is responsible for deciding which
hardware tasks are currently running. However, it is not possible to run tasks
without the placer which manages the 2D free space hardware area within the
recongurable device. Moreover, the placer is also responsible for keeping track
of all used communication channels. In order meet all hardware task constraints
communication aware placement has to be combined with the scheduling process, as for example it makes no sense to schedule a task for execution when its
memory or communication constraints cannot be met by the placer.
2.4. Technical Advantages and Limitations
Recongurable computing has the advantage of greater functional density through
the use of a more simple hardware device. Needed logic blocks can be loaded on
41
2. Background
demand into the recongurable device. The high device cost can be reduced to
the low cost of a smaller device and an additional memory required to store the
logic design. Because many new systems have a network connection, this cost
of extra conguration memory can be cut by implementing on demand update
strategies directly into the operating system of the recongurable device.
There are several advantages of recongurable computing over general purpose
processors on one hand, and ASICs on the other hand:
•
Acceleration of various compute-intensive applications and very high speed
implementations of sequential tasks.
•
Easy support for bug xes and upgrades in the eld with no down time.
Moreover, recongurable devices enable aggressive time-to-market strategies with only core features being implemented on roll out. Missing features can be delivered later on via an upgrade. Moreover, this extends the
life cycle of the system, thus reducing costs for the owner.
•
Hardware multitasking enables multiple applications to run concurrently
on the same device. This enables true recongurable computing with multiple optimized applications running concurrently on one recongurable
device. As parts of the application can be developed independently these
systems can have shorter design and implementation cycles.
•
Hardware sharing between hardware tasks is realized because running
tasks can be replaced.
Benets are reduced device size, reduced power
consumption and overall lower costs.
•
Shorter reconguration times through partial reconguration enable frequent reconguration changes if required by the application. This enables
self-adaptive systems which deliver consistent performance in changing
environments.
However, there are three main limitations which need to attract more attention
in order to move recongurable computing towards mainstream adoption.
Compilers and back-end tools for recongurable computing are still under development. Not only are ecient high level compilers supporting partial reconguration missing but also the low level back-end tools coming from corresponding
42
2.4. Technical Advantages and Limitations
chip vendors must be improved for a productive design environment [19]. Commercial support for partial reconguration must also be available together with
a well dened design ow.
Debugging support for partial reconguration is currently not available as a
debugging methodology including supporting tools is not available. Clock distribution and communication channels in recongurable systems are another
problem source.
Finally, the inability to verify run-time recongurable systems is an obstacle
for medical, aeronautical and mission critical control systems. The only viable
option is to emulate the run-time behavior. This can be done by implementing
all partial recongurable tasks at the same time on a much larger device and
to select the correct module through additional multiplexers.
However, the
reconguration is only emulated and the reconguration process itself is not
performed.
43
2. Background
44
3. The Erlangen Slot Machine
3.1. Introduction
Partial reconguration requires run-time loadable modules that are pre-compiled
and their bitstreams are stored in an external memory device, which will then
be used to recongure a dedicated region on the FPGA. Several models and
algorithms for on-line placement have been developed in the past, see e.g.,
[96, 97, 98, 3].
However, these algorithms are limited by two main factors.
First of all, the model assumptions are often not realistic enough for implementation on real hardware or require a tedious development process. Second, the
development process of recongurable modules is subject to many restrictions
that make a systematic development process for partial reconguration dicult.
Until now, no FPGA-based platform on the market provides a solution to the
problems of design automation for dynamically recongurable hardware modules
and their ecient and exible relocation. The purpose of the
Machine (ESM) [4, 99, 100, 101, 45, 40, 42]
Erlangen Slot
is to overcome many of the
deciencies of existing FPGA-based recongurable computers by providing:
•
A new exible FPGA-based recongurable platform that supports relocatable hardware modules arranged in so-called
•
slots.
Tool support for the development of run-time recongurable computation and communication schemes using new inter-module communication
paradigms.
45
3. The Erlangen Slot Machine
•
A powerful reconguration manager which enables various pre-processing
stages for fast bitstream manipulation. We call the pre-processing stages
plug-ins .
For example, a relocation plug-in can be selectively activated
before a bitstream is uploaded to the FPGA.
Reconguration times in the range of seconds [102] are not sucient for applications that require a fast reaction to external events. Our hardware reconguration manager is the foundation for reconguration times in the range of
milliseconds. For example, these fast reconguration times will allow a seamless
switching of video lters in a video pipeline processing architecture.
The main goal of the Erlangen Slot Machine's architecture [99, 100, 103, 2] is
to accelerate application development as well as research in the area of partially
recongurable hardware.
The Erlangen Slot Machine owes its name to this
arrangement of recongurable slots which can be changed independently. This
modular organization of the device simplies the relocation, primary condition
for a viable partially recongurable computing system.
Each module moved
from one slot to another will come across equal resources.
The advantage of the
ESM
platform is its one dimensional (1D) slot-based
architecture with support for varying slot widths. Slots are predened recongurable regions in which hardware tasks can be exchanged during run-time. The
slot architecture on the ESM enables the execution of independent as well as
communicating hardware tasks by delivering peripheral data through a separate
crossbar switch to each slot. This is shown in Figure 3.1.
We decided to design an o-chip crossbar in order to leave as many resources
free on the FPGA for partially recongurable modules. The
ESM architecture is
based on the exible decoupling of the FPGA I/O-pins from a direct connection
to an interface chip. This exibility allows independent placement of application
modules in any available slot at run-time. Thereby, run-time placement is not
constraint by physical I/O-pin locations, as the I/O-pin routing is performed
automatically in the crossbar; thus, the I/O pin dilemma is solved in hardware.
The
ESM
platform, as shown in Figure 3.1, is centered around a large FPGA
serving as the main recongurable engine and a second FPGA realizing the
crossbar switch for peripheral I/O access. These two FPGAs are placed on two
physical boards, called
Babyboard
and
Motherboard.
The main recong-
urable device is a Xilinx Virtex-II 6000 FPGA and located on the Babyboard.
46
3.1. Introduction
ESM Motherboard
Flash
Main FPGA
ESM Babyboard
SRAM
SRAM
S1
S2
SRAM
S3
Reconfiguration
Manager
PowerPC
SRAM
…
Crossbar
SN
SDRAM
Peripherals
Figure 3.1.:
ESM architecture overview with main FPGA, crossbar and an external PowerPC microprocessor for system control functions.
The
architecture of the Babyboard is further rened in Figure 3.7. The
Motherboard is shown in Figure 3.12.
The crossbar is implemented by a Xilinx Spartan-IIE 600 FPGA and located
together with all peripheral I/Os on the Motherboard.
Figure 3.1 shows the
slot-based architecture of the ESM consisting of the Virtex-II 6000 FPGA, local
SRAM, conguration memory and a reconguration manager FPGA.
The number of recongurable slots depends on the number of I/O pins needed
for SRAM access. If SRAM access is not required then the slot width depends
only on the number of I/Os connected to the crossbar interface. All I/O blocks
of the main FPGA are located at the edges of the device, as shown in Figure
2.2.
The top pins in the north of the FPGA connect to local SRAM banks.
These SRAM banks thus solve the problem of restricted intra-module memory,
in the case of video applications, for example. The bottom pins in the south
connect to the crossbar switch. Therefore, a module can be placed in any free
slot and have its own peripheral I/O-links together with dedicated local external
47
3. The Erlangen Slot Machine
memory. The slot width is only predened if hardware modules require access
to the external SRAM. This is due to the xed number of signals needed to
access and to control one SRAM device.
3.2. Communication Models
One of the central limiting factors for the wide use of partial dynamic reconguration yet not addressed is the problem of inter-module communication. Each
module that is placed on one or more slots on the device must be able to communicate with other modules. For the ESM, we investigated and provide four
main paradigms for communication among dierent modules, as shown in Figure
3.2. The rst one, is a direct communication using bus-macros [104, 105, 106]
between adjacently placed modules. Secondly, shared memory communication
using SRAMs or BlockRAMs is possible. However, only adjacent modules can
use these two communication modes. For modules placed in non-adjacent slots,
we provide a dynamic signal switching communication architecture called recongurable multiple bus (RMB) [107, 46]. In [108] we presented an ILP model
for minimizing the communication cost for RMB slot modules.
Finally, the
communication between two dierent modules can also be realized through the
external crossbar.
Communication between Adjacent Modules
On the ESM, bus-macros are
used to realize a direct communication between adjacently placed modules, providing xed communication channels that help to keep the signal integrity upon
reconguration.
Bus-macros provide a means of locking the routing between
partial recongurable modules (PRMs) and the static base design, making the
PRMs pin compatible with the base design.
As a result, all connections be-
tween the PRMs and the base design must pass through a bus-macro, with
the exception of the clock signal (global signals, GND and VCC, are handled
automatically by the Xilinx design ow tools in a way that is transparent to
the user).
As stated in Section 2.2.1 on page 35, eight signals can be passed
through each standard Xilinx bus-macro [21]. Hence, the number of bus-macros
needed to connect a set of
n
signals between two PRMs is
dn/8e.
The use of
custom build bus-macros would allow to dene the data width per bus-macro
arbitrarily, but requires extensive overhead to design and manually route the
hard macro.
48
3.2. Communication Models
b)
SRAM
S1
S2
S3
S1
S2
S3
SRAM
FPGA
FPGA
a)
SRAM
S1
S2
S3
d)
FPGA
FPGA
c)
S1
S2
S3
Crossbar
Figure 3.2.:
Inter-module communication possibilities on the ESM: a) busmacro, b) shared memory, c) recongurable multiple bus (RMB),
d) external crossbar. Hardware modules can also with software running on the PowerPC microprocessor via the crossbar.
Communication via Shared Memory
Communication between two neigh-
boring modules can be done in two dierent ways using shared memory: First,
dual-ported BlockRAMs can be used for implementing communication among
two neighbor modules working in two dierent clock domains. The sender writes
on one side, while the receiver reads the data on the other side.
The second
possibility uses external RAM. This is particular useful in applications in which
each module must process a large amount of data and then sends the processed
data to the next module, as it is the case in video streaming.
On the ESM, each SRAM bank can be accessed by the module placed below as
well as those neighbors placed right and left. A controller is used to manage the
49
3. The Erlangen Slot Machine
SRAM access. Depending on the application, the user may set the priority of
accessing the SRAM for the three modules.
Communication via RMB
ple Bus (RMB)
In its basic denition, the
Recongurable Multi-
architecture [109, 110, 46, 47] consists of a set of processing
elements or modules, each possessing an access to a set of switched bus connections to other processing elements. The switches are controlled by connection
requests between individual modules.
The RMB is a one-dimensional arrangement of switches between N slots (see
Figure 3.3). In our FPGA implementation, the horizontal arrangement of parallel switched bus line segments allows for the communication among modules
placed in the individual slots. The request for a new connection is done in a
wormhole fashion, where the sender (a module in slot
Sk )
sends a request for
communication to its neighbor (slot
Sk+1 )
Sk+1
etc., until the receiver receives the request
sends the request to slot
Sk+2 ,
in the direction of the receiver. Slot
and returns an acknowledgment. The acknowledgment is then sent back in the
same way to the sender.
SRAM
SRAM
SRAM
SRAM
SRAM
S1
S2
S3
S4
S5
S6
CP1
CP2
CP3
CP4
CP5
CP6
FPGA
SRAM
Reconfigurable Multiple Bus (RMB)
Figure 3.3.:
ESM slot architecture with six macro-slots (S1, S2, ... S6). In order
to allow access to the RMB crosspoints (CP) and SRAM banks, one
macro slot consists of three micro-slots. Physically, one micro-slot
occupies exactly four CLB columns.
50
3.2. Communication Models
Each module that receives an acknowledgment sets its switch to connect two
line segments.
Upon receiving the acknowledgment, the sender can start the
communication (circuit routing). The wired and latency-free connection is then
active until an explicit release signal is issued by the sender module. The concept of an RMB was rst presented in [110] and extended later in [109] with
a compaction mechanism for quickly nding a free segment.
However, it has
never been implemented in real hardware.
In our implementation [111] of an RMB architecture on Xilinx Virtex FPGAs,
we separated the RMB switches from the modules.
In this way, we provide
a uniform interface to designers for connecting modules to the multiple line
switches.
The implementation of the RMB structure on an FPGA Virtex II
6000 with four processors and four parallel 16 bit lines reveals an area overhead
of 4% with a frequency of 120 MHz for the RMB controller [107].
Here, we
have summarized area and data speed numbers in terms of a) dierent numbers
of modules, b) dierent numbers of parallel bus segments, and c) bit widths of
each bus segment. Special bus-macros are used at the boundary of modules and
controllers to ensure a correct operation upon reconguration.
We were able to show that a module reconguration can take place column-wise
at the same time that other modules are communicating on the chip without any
signal interference. This is possible by storing the states of the RMB switches in
regions of BlockRAM that are physically unaected by partial reconguration.
Communication via the Crossbar
Another possibility of establishing a com-
munication among modules is to use the crossbar. Because all the modules are
connected to the crossbar via the pins at the south of the FPGA, the communication among two modules can be set in the crossbar as well.
Communication Costs
The ESM platform supports the four mentioned com-
munication schemes for inter-module communication.
Each approach has its
own properties, such as maximum bandwidth, signal delay and setup latency.
The RMB is the only scheme that has a varying setup latency that is the product
of the number of RMB elements to destination and the setup time of four clock
cycles. Using bus-macros for communication is the preferred choice, but it only
works for adjacent modules.
The maximum bandwidth in all communication
51
3. The Erlangen Slot Machine
schemes is a factor of clock speed and data bandwidth. We assume for the ESM
a global clock speed of 50 MHz. All properties are listed in Table 3.1.
Communication scheme
Bus-macro
Data Bandwidth
Latency
Setup
19.2 Gbit/s
2 ns
none
RMB
6.4 Gbit/s
3 ns * CP
4 cycles * CP
Crossbar
1.8 Gbit/s
15 ns
18 cycles
SRAM
0.4 Gbit/s
20 ns
2 cycles
Table 3.1.:
Theoretical data bandwidth and signal latency for the four supported
communication schemes. Variable CP denotes the number of RMB
Cross Points that are traversed.
3.3. Implemented Architecture
The Erlangen Slot Machine was designed as a two-board solution, consisting
of Babyboard and Motherboard.
The separation of ESM into a Babyboard
and a Motherboard was made to simplify the adoption of the ESM platform to
other application domains such as automotive. In order to do so, only a new
Motherboard can be designed to have dierent peripherals such as CAN, LIN,
FlexRay controllers, and A/D and D/A converters. The Babyboard design can
remain unchanged and can be used with dierent application domain specic
Motherboards.
The schematic of the whole ESM platform is illustrated in Figure 3.4 and its
two-board implementation is shown in Figure 3.5. The reconguration manager
(RCM) is implemented in a Spartan-IIE 400 FPGA which is connected to a 64
MByte ash device on the Babyboard. Six SRAM banks, two MByte each, are
attached to the north side of the FPGA. They provide memory space to six
macro-slots (denoted as S1 to S6 in Figure 3.3) for temporal data storage. The
SRAMs can also be used for shared memory communication between neighbor
modules, e.g., for streaming applications. They are connected to the FPGA in
such a way that the reconguration of a given module will not aect the access
to other modules. Debugging capabilities are oered through general purpose
I/O provided at regular distances between the micro-slots.
52
3.3. Implemented Architecture
Babyboard
SRAM
SRAM
SRAM
SRAM
SRAM
SRAM
Flash
Main
Main FPGA
FPGA
CPLD
RCM
RCM
FPGA
FPGA
EPP
Crossbar
Crossbar
FPGA
FPGA
S-Video
S-Video
CVBS
Audio
VGA
PowerPC
SDRAM
DVI
Flash
Ethernet
Figure 3.4.:
Motherboard
SDRAM
VideoOut
VideoOut
FPGA
FPGA
SDRAM
SDRAM
Debug I/O
USB
Serial
Schematic diagram of the ESM shows the implemented two-board
solution with an FPGA Babyboard and a supporting Motherboard.
A JTAG port provides debug capabilities for the main FPGA, the SpartanII FPGA and a CPLD. All technical data sheets as well as software, primer
applications, and additional information are available at the following website:
http://www12.informatik.uni-erlangen.de/research/esm.
An actual picture of
53
3. The Erlangen Slot Machine
the two ESM boards is shown in Figure 3.5.
Figure 3.5.:
ESM implementation of the FPGA Babyboard and the supporting
Motherboard. On top of the Motherboard sits the Babyboard with
the Virtex-II 6000 FPGA. Additional technical data and examples
are available at
Slot Arrangement
so-called
micro-slots
http://www.r-space.de.
The main FPGA of the ESM is organized into twenty-two
with twelve I/O-pins each, as the Virtex-II FPGA can only
be recongured column-wise. This is shown in Figure 3.6. Because the left and
right slots of the FPGA are connected to dedicated I/Os, one micro slot on
both the left and the right side of the FPGA is excluded for use in partially
recongurable designs. As the middle CLB columns are connected to external
clock lines, two micro-slots in the middle of the device are also excluded. Three
micro-slots can be grouped logically into one
to the RMB and SRAM banks.
macro-slot
in order to allow access
An overview is shown in Figure 3.3 and the
resulting micro-slot architecture is shown Figure 3.6. Due to the incorporation
of BlockRAM and multipliers, the Virtex-II FPGA architecture from Xilinx is
divided into columns. Each BlockRAM block occupies a whole column in the
device; the XC2V6000 FPGA has six slots that are spread over the device.
Therefore, only macro-slots one and four contain one BlockRAM column.
Slots A to V denote micro-slots that provide the module and reconguration
granularity. Three consecutive micro-slots dene a macro-slot. Each
54
macro-slot
3.3. Implemented Architecture
(S1 to S6) can access one full external SRAM bank.
In terms of slice count,
a micro-slot occupies 1536 slices (4 CLB columns) on the FPGA. Six microslots are highlighted as they contain BlockRAM cells. Slots A, K, L and V are
special micro-slots as slots A and V interface external pins and slot K, L contain
BlockRAM.
Figure 3.6.:
Slot architecture of the main FPGA with macro-slots built from
micro-slots.
Reconguration Overhead
that may be recongured is a
In a Virtex-II FPGA, the least congurable unit
frame, which covers the whole height of the FPGA.
One CLB column consists of 22 frames. The frame length depends on the number of CLB rows of the FPGA. The Virtex-II 6000 FPGA, consists of 96 rows by
88 columns of CLBs. Hence, each frame has 246 words for this specic FPGA
device. The reconguration manager of the ESM uses the SelectMAP interface
for programming the main FPGA. The 8 bit bus width of the SelectMAP interface and a maximum frequency of 50 MHz have to be taken into account. The
duration of the reconguration process for one CLB column is thus 246 words
55
3. The Erlangen Slot Machine
(each 32 bit)
∗
22 frames
∗
4 clock cycles
∗
20 ns = 433µs.
3.4. The Babyboard
The recongurable engine of the Erlangen Slot Machine consists of a Xilinx Virtex II-6000 FPGA, several SRAMs, a reconguration FPGA and ash memory.
They all are placed on a high density printed circuit board (PCB). This PCB is
called the
restriction
Babyboard
1
and has four connectors to the
Motherboard.
Due to the
in the reconguration process of Virtex-II FPGAs, the architec-
ture has been optimized to solve the major problems of partial recongurable
hardware platforms, namely:
Solving the I/O-pin dilemma:
Run-time placement of modules on a recong-
urable device, in this case the FPGA, is done by downloading a partial bitstream that implements the module on the FPGA. This requires a relocation
that places a module in a location dierent from the one for which it was synthesized.
Relocation can be done only if all the resources are available and
structured in the same way in the designated placement area at compile-time.
This includes also the I/O-pins used by the module.
For example, a module
compiled for slot 0 might then be allocated to slot 3 at run-time.
We solved
the I/O-pin dilemma on the ESM by avoiding xed connections of peripherals
to the FPGA. As shown in Figure 3.7, all the bottom pins from the FPGA are
connected to an interface controller realizing a crossbar and implemented itself
using a Xilinx Spartan-II FPGA. At run-time, the crossbar connects FPGA
pins to peripherals automatically based on the slot position of a placed module.
This I/O-pin rerouting principle is done without reconguration of the crossbar FPGA. The solution is to implement conguration registers in the crossbar
which can be read and written at module load time by the PowerPC microprocessor located on the Motherboard. This makes it possible to establish any
connection from one module to peripherals dynamically.
Solving the memory dilemma:
Memory is very important in applications like
video streaming in which a given module must exclusively access a picture at
a time for computation. However, as we mentioned earlier, the capacity of the
available BlockRAMs in FPGAs is limited. External SRAM memory is therefore
1 The
56
reconguration can be done only in chunks of full columns.
3.4. The Babyboard
added to allow storage of large amounts of data by each module. To allow a
module to exclusively access its external memory bank, six SRAM banks are
connected at the north border of the FPGA. In this way, a module will connect
to peripherals from the south, while the north will be used for temporally storing
computation data. According to the physical layout of the six memory banks
which are connected to the top I/O pins, the FPGA device is divided into a
set of elementary slots called
micro-slots
A to V as shown in Figure 3.6. In
order to use an SRAM bank in the north, a module must have at least a width
of three micro-slots (creating slots S1 to S6).
Babyboard
SRAM
SRAM
SRAM
SRAM
SRAM
SRAM
Flash
Main
Main FPGA
FPGA
CPLD
RCM
RCM
FPGA
FPGA
EPP
Debug I/O
To crossbar FPGA on the Motherboard
Figure 3.7.:
The main components of the Babyboard are the main FPGA for user
applications, a Reconguration Manager (RCM) FPGA for conguration management, and a CPLD for the initialization routines after
power-up.
The basic layout of the Babyboard is depicted in Figure 3.7.
Applications
contain tasks that consist each of a module request. They are located on the
main FPGA,
a Xilinx Virtex-II 6000 device.
This FPGA is connected to six
SRAM memories, two Megabytes each, which can be accessed independently
by dierent applications. Each of the twenty-two micro-slots has its own debug
57
3. The Erlangen Slot Machine
I/O pin that is externally accessible. These twenty-two Debug I/O signals can
be used for visualization, a dedicated debug interface or as special interface,
i.e. for a CAN bus module. The main FPGA is linked to the Motherboard via
264 bits clustered in so called
I/O
signals.
The implemented ESM Babyboard is shown in Figure 3.8. It contain the following components on a 12-layer PCB: the main FPGA is a Xilinx Virtex-II 6000
(1), the reconguration manager FPGA is used to control the ash memory and
reconguration process of the main FPGA (2), 64 MByte of ash memory are
used to store full and partial bitstream for the main FPGA (3), a CPLD used to
initialize the ash memory and the reconguration manager (RCM) at start-up
(4), six SRAM banks with a size of of 2 MByte each (5), optional SO-DIMM
memory socket for DDR memory, and external debug I/Os for independent
status monitoring of partial recongurable modules.
Figure 3.8.:
The
58
The ESM Babyboard and its components.
Reconguration Manager,
herein after referred to as
RCM,
enables the dy-
3.4. The Babyboard
namic reconguration ability of the ESM platform. It programs the main FPGA
via the SelectMAP interface with the bitles stored in the attached ash memory. The RCM is controlled and supplied with bitles by the
PowerPC
micro-
processor mounted on the Motherboard. Additionally it provides an Enhanced
Parallel Port (EPP) interface which also allows external control and debugging
of the reconguration process.
The CPLD runs the initialization routines for the board upon power-up, e.g.
programming a PLL for the required clocks and conguring the RCM with a
start-up conguration stored in ash memory. After completed power-up of the
reconguration FPGA the CPLD goes itself into an idle state and disconnects
form the ash memory bus which is then controlled by the RCM.
3.4.1. Main FPGA
Xilinx Virtex-II 6000, called main FPGA.
homogeneous CLB area of 96 × 88 CLBs, as shown
The actual workhorse of the ESM is a
This FPGA has an almost
in Figure 3.6, and was the only readily available FPGA of its size at the time
of the rst design of the ESM architecture. With more than 88.000 logic cells,
scores of applications can be implemented in hardware [25].
On the ESM the FPGA's logic area is divided into twenty-two
micro-slots
to
empower an organized reconguration and relocation of hardware modules [99,
100]. Each slot spans the same area and has the same amount of I/Os at its
disposal, four control and eight data bits.
Additionally, there are six special
micro-slots which contain 24 BlockRAM elements each. The arrangement can
be seen in Figure 3.6. It also depicts the clustering into the already mentioned
macro-slots.
This coarse partition is intended for more extensive modules which
need a large amount of memory. For these purposes, six asynchronous SRAMs
are attached at the top of the main FPGA.
The hardware modules on the main FPGA have dierent options to communicate with other modules. To maximize placement exibility, the Crossbar can
establish point-to-point connections between any I/O groups.
Bus-macros are the rst choice for high data rate communication between hardware tasks then. They are instantiated along the module borders and allow adjacent modules to communicate directly with each other. However, this implies
59
3. The Erlangen Slot Machine
preassigned positions according to the design choice made during the compile
time of each hardware module. A solution for this issue is presented in [107, 46],
the Recongurable Multiple Bus on Chip. The mentioned use of the SRAMs as
shared memory is technically equal to bus-macro communication except for the
inherent buering and the overhead for each read/write access.
To implement an application on this device the top-level VHDL entity, or according Verilog module, should have ports for the I/Os listed in Table 3.2. When
developing modules for partial reconguration, the Xilinx
Reconguration
Early Access Partial
user guide [95, 21] should be followed.
Width
Port
Description
22x4 bits
Scheduling_A-V
22x8 bits
XbarIO_A-V
6x20 bits
sram_a
SRAM address, shared with two chips
6x8 bits
sram_d
SRAM data, shared with two chips
6x2 bits
sram_oe_neg
SRAM output enable, for each chip
6x2 bits
sram_we_neg
SRAM write enable, for each chip
6x1 bit
sram_ce_neg
SRAM chip enable, shared with two chips
22 bits
debug_io
Scheduling ags for each Micro Slots
Data I/O for each Micro Slot
For LEDs or special adaptors like a CAN
controller
6 clocks
clk
Table 3.2.:
PLL clock signals
Interface of the main FPGA
3.4.2. The Reconguration Manager
The ESM platform requires an operating system for the initialization of executable application modules and their run-time supervision. The main tasks of
such an operating system are:
60
•
scheduling of application modules,
•
management of free slots including slot segmentation and partitioning,
3.4. The Babyboard
•
loading, unloading and relocation of application modules into slots,
•
conguration of peripheral devices,
•
conguration of the crossbar, and
•
bitstream management.
In our view, the most-time critical operations must be executed in hardware
in order to keep the reconguration time at a minimum. We consider loading,
unloading and relocation of modules to be the most time critical tasks which
will be therefore implemented in a dedicated hardware Reconguration Manager
(RCM) [43, 42]. All other system tasks can implemented in software and executed on the PowerPC embedded microprocessor that is mounted on the ESM
Motherboard. These two parts of the operating system are linked via a simple
communication bus as shown in Figure 3.1. This hardware/software interface
between the RCM and the PowerPC is realized through a set of elementary
reconguration instructions passed from the PowerPC to the reconguration
manager on the Babyboard using memory mapped I/O communication.
The
benet of this communication method is a simple read/write access to a range
of memory addresses which are physically located inside the Crossbar FPGA.
In its basic form, the reconguration manager must implement the following
minimal set of elementary instructions:
•
LOAD: load bitstreams to their pre-compiled position,
•
UNLOAD: unload bitstreams to deactivate a running module,
•
RELOCATE_AND_LOAD: relocate bitstreams to a dierent slot position before loading.
Reconguration Architecture
Apart from the main FPGA, the Babyboard
also contains the conguration circuitry. This consists of a CPLD, a conguration FPGA implementing the reconguration management and a ash memory
device, as shown in Figure 3.7.
A Xilinx Spartan-IIE 400 device implements
the Reconguration Manager. It comprehends a Xilinx
MicroBlaze
processor to
control various I/O modules, e.g. the ash memory or the SelectMAP interface
of the main FPGA.
61
3. The Erlangen Slot Machine
•
The CPLD is used to download the Spartan-IIE FPGA conguration from
the Flash upon power-up. It also contains board initialization routines for
the on-board PLL and the Flash.
•
The reconguration management is implemented on the Spartan-II FPGA.
It is also responsible for the conguration of the main FPGA during powerup and run-time. This device also contains a circuit to perform module
relocation while loading a new partial module bitstream.
•
The Flash provides a capacity of 64 MByte, thus enabling the storage of
up to 30 full congurations or of a few hundred partial module bitstreams
typically.
During normal operation, bitstream data is loaded from the ash memory into
the main FPGA through the SelectMAP interface (see Figure 3.7). However,
bitstreams must be downloaded from a host PC and then stored in the ash
memory device. Here, two methods are supported. The rst method uses a parallel port interface implemented inside the reconguration manager to download
the conguration data from a host PC to the ash memory. The second method
uses the Ethernet port of the PowerPC microprocessor to download bitstreams
from a remote host.
In order to support these and also many other recon-
guration scenarios, we developed an extensible, plug-in based reconguration
manager architecture that will be described next.
Flexible Plug-in Architecture
Our rst implementation had a block oriented
reconguration manager and consisted of a simple state machine which controlled all interfaces and operated on byte blocks. These data blocks, 512 bytes
each, correspond to the page size of the ash memory device. For each primitive operation on a data block, an instruction had to be processed. When one
data block was written from ash into the Virtex-II SelectMAP interface, two
instructions had to be processed. First, the data block was read in 512 cycles
from the ash device and written to an internal scratch pad. Then, the second
instruction was read and the data block from the scratch pad was written to
the SelectMAP interface.
As all instructions were executed sequentially, the
maximum upload speed of a bitstream to the FPGA was slowed down by factor
two, due to the exclusive access to the scratch pad.
However, the main problem with this architecture arose when extensions were
to be added to the reconguration manager. If for example, an error correcting
62
3.4. The Babyboard
code (ECC) plug-in and a decompression plug-in are used additionally, then the
throughput of the reconguration manager will be decreased by a factor of six.
This is due four additional instructions that are needed to read and write the
internal the scratch pad. This initial scenario is illustrated in Figure 3.9 b). An
additional maintenance issue is the global nite state machine itself. Its code
base had to be changed every time a new plug-in was added or removed.
a)
b)
Flash
State
Machine,
Scratch Pad
Virtex2
Flash
State
Machine,
Scratch Pad
ECC
Figure 3.9.:
Virtex2
Relocator
Simple reconguration manager architecture.
Clearly, this rst block oriented architecture is not suitable for a high performance solution, since the throughput decreases with every new attached plug-in.
The main bottleneck is not the ash interface but the scratch pad-oriented data
ow combined with the sequential execution of each instruction.
Based on these consolidated ndings, we propose a novel architecture for the reconguration manager which can upload bitstreams into the FPGA at the speed
of the ash interface. The central scratch pad was eliminated and replaced by a
pipelined data ow architecture. Moreover, a) the nite state machine was replaced by a MicroBlaze microcontroller [85], and b) a data crossbar is employed
between plug-ins to enable customizable communication paths. This new architecture is depicted in Figure 3.10. The crossbar plug-in shown in this gure is a
communication interface between the RCM software running on the MicroBlaze
controller and the ESM Motherboard with is PowerPC microprocessor shown
in Figure 3.1. The RCM software controller receives its instructions and new
bitstreams form the PowerPC, through the crossbar communication plug-in.
All plug-in modules are connected to two communication interfaces: The control
bus connects plug-ins to the MicroBlaze for initialization and control. The data
crossbar connects to the data input and output ports of each plug-in.
The
setup of the data crossbar also controlled by the MicroBlaze software and can
be dynamically changed during run-time.
63
3. The Erlangen Slot Machine
In order to upload a hardware module from ash to the FPGA, the following
sequence of steps has to be performed:
1. Command is sent to the MicroBlaze to upload a bitstream to the FPGA
without the use of any other plug-ins.
2. Program running on the MicroBlaze connects the output of the ash plugin to the Virtex-II plug-in input through a write into the conguration
register of the data crossbar.
3. Next, this program initializes the ash plug-in with the start address and
length of the bitstream.
4. Then, the program enables the SelectMAP interface in the Virtex-II plugin.
5. Finally, the ash plug-in is enabled and starts to read the bitstream.
6. The ash plug-in sends the bitstream to the Virtex-II plug-in byte by byte
as long as its ready signal is true (if not, the ash plug-in has to wait).
7. While the ash and the Virtex-II plug-in are running in parallel, the MicroBlaze checks periodically if any of the plug-ins has nished its operation.
8. Only if after nishing one command, the MicroBlaze can execute a new
command, and, for example, reinitializes the plug-ins and the data crossbar.
If one load command has been executed and another load follows, then the procedure starts from second step, because the data crossbar has already been set.
The addition of plug-ins to the reconguration manager is simple. Any new module must have a xed control bus interface and a xed data crossbar interface.
With these standard interfaces, the plug-in can be directly controlled through
the MicroBlaze assembly program. The data crossbar uses a parametrized HDL
description which can be congured at design-time to the number of actually
instantiated plug-ins.
Workload Scenarios
Depending on the operating system requirements, dif-
ferent operations need to be performed on each bitstream. Before the bitstream
64
3.4. The Babyboard
External I/O
Flash
ECC
Relocator
External I/O
MicroBlaze
Virtex2
External I/O
Crossbar
Control Bus
Figure 3.10.:
Data Crossbar
Architecture of the ESM reconguration manager with plug-ins
such as Flash, ECC, module relocator and other possible plug-ins.
is uploaded to the FPGA, it can pass through any number of additional plugins. The order in which a bitstream passes the plug-ins is congurable at runtime through the setup of the data crossbar switch. This allows a exible preprocessing of the bitstream prior to being loaded. Only the number of available
plug-ins in the reconguration manager has to be determined at design-time.
Based on the introduced reconguration manager architecture from Figure 3.10,
several ows are possible. Some of these are depicted in Figure 3.11. In the rst
scenario, only a basic upload of a bitstream is performed. Therefore, the data
ows from the ash plug-in output directly through the data crossbar to the
Virtex-II plug-in input. If an error-correction is needed, then the ash output
data can be sent to the ECC plug-in before going to the Virtex-II plug-in. This
case is shown in Figure 3.11 b).
In the third scenario, the bitstream is read
from the ash, error-corrected and relocated before being sent to the Virtex-II
plug-in for upload (see Figure 3.11 c)). Here, the crossbar is congured by the
microprocessor in such a way that the output of each plug-in is connected to
the input of its neighboring plug-in.
The fourth scenario depicted in Figure
3.11 d) shows how the bitstream data is delivered by the PowerPC through
the Motherboard crossbar. The bitstream is subsequently error-corrected and
relocated prior to its upload.
The plug-ins that are currently implemented for the reconguration manager
65
3. The Erlangen Slot Machine
a)
b)
Micro
Blaze
Flash
Flash
ECC
ECC
Relocator
Relocator
Micro
Blaze
Virtex2
Virtex2
Crossbar
Control Bus
Crossbar
Data Crossbar
c)
Control Bus
d)
Micro
Blaze
Flash
Flash
ECC
ECC
Relocator
Relocator
Micro
Blaze
Virtex2
Virtex2
Crossbar
Control Bus
Figure 3.11.:
are:
Data Crossbar
Crossbar
Data Crossbar
Control Bus
Data Crossbar
Four dierent workload scenarios for the reconguration manager.
ECC plug-in, decompression plug-in and a relocator plug-in which can
translate a bitstream on the y to any slot location on the FPGA by directly
manipulating the address osets in the bitstream at load-time.
The reconguration manager was implemented and consists of the MicroBlaze
microcontroller, parallel port interface plug-in, ash memory interface plugin, Virtex-II SelectMAP plug-in, an OPB (on-chip peripheral bus) interface
implementing the control bus and the data crossbar. The control bus is a 32 bit
OPB bus, while the data crossbar is an 8 bit full duplex crossbar.
66
3.5. The Motherboard
3.5. The Motherboard
The Motherboard of the ESM platform, as illustrated in Figure 3.12, provides
programmable links from the FPGA to all multimedia and communication peripherals, such as USB, Ethernet, Video Input and Output, and Audio-I/Os. It
also links the PowerPC with the RCM for reconguration actions. The PowerPC
is them main controller of the ESM system and running Linux. Its memory bus
is connected directly to the crossbar for memory-mapped communication with
the reconguration manager on the Babyboard.
To RCM FPGA
Crossbar
Crossbar
FPGA
FPGA
S-Video
S-Video
CVBS
Audio
VGA
SDRAM
PowerPC
DVI
Flash
Ethernet
Figure 3.12.:
Motherboard
SDRAM
Video-Out
Video-Out
FPGA
FPGA
SDRAM
SDRAM
To main FPGA
USB
Serial
The main component of the Motherboard is the Crossbar FPGA
which connects all peripherals, PowerPC, and Video-Out FPGA
with the main FPGA on the Babyboard.
The physical connections are established at run-time through a programmable
crossbar implemented on a Spartan-IIE FPGA device on the Motherboard.
Video capture and rendering interfaces as well as high speed communication
links are also located on the Motherboard. The Babyboard is mounted through
four connectors on top of the Motherboard. An embedded Linux [112] has been
67
3. The Erlangen Slot Machine
adopted to run on the
PowerPC
microprocessor (MPC875) which is the core of
the ESM Motherboard. It is used to control the complete system. In particular,
it manages the data ow from the peripheral I/O interfaces to the Babyboard
as well as the interfaces to the external world, e.g., Ethernet and USB. Upon
start-up, one can log-in into the ESM just as for a full Linux-based computer
system. The PowerPC of the ESM is used for application development or for
testing and the control of the partial reconguration process of the main FPGA
on the Babyboard, e.g., operating system functions for module management.
The printed circuit board implementation is shown in detail in Figure 3.13.
Figure 3.13.:
The ESM Motherboard and its components.
As already mentioned, the Motherboard supplies the peripherals for the Babyboard and can be adapted for dierent domains like automotive or home en-
68
3.5. The Motherboard
tertainment.
The current implementation already provides various I/O ports
which not all are implemented yet, for instance Audio and DVI. The crossbar
FPGA manages the connections between peripheral I/O devices on the Motherboard and main FPGA on the Babyboard. Further details and information
SAA7113H video input processor will be given later on. A special
peripheral is the Video-Out as it is implemented within a separate FPGA. This
about the
allows to handle dierent graphic I/Os while using only a few pins at the Crossbar FPGA. This is an important feature as the crossbar FPGA is I/O limited
due to the high number of connections going to the main Virtex-II 6000 FPGA.
The PowerPC provides the main control unit of the ESM. It is operated by a
customized embedded Linux which can be accessed by a serial terminal or a
remote login via Ethernet. It handles the specic connections of the Crossbar,
loads bitles to the RCM ash, initiates recongurations of the main FPGA,
and cooperates with the latter via scheduling ags and the Hardware-Software
Communication. The ESM Motherboard and its components are shown in Figure 3.13: crossbar FPGA (1), four high-density connectors to the Babyboard
(2), MPC875 PowerPC microprocessor (3), PowerPC's SDRAM main memory
(4), PowerPC's ash memory (5), video-out FPGA (a), DVI output (b), S-Video
output (c), video output (d), VGA output (e), rst video input (f ), S-Video input (g), second video input (h), 100 Mbit Ethernet connected to the PowerPC
(i), Mini-USB (j), two IEEE1394 ports connected to the crossbar (k), AC97
audio in and out ports (l).
3.5.1. PowerPC
Primarily, the embedded
Freescale MPC875
PowerPC microprocessor is the
major control unit of the ESM. This gives the ESM the added possibility to
write software applications for testing and implementing scheduling, module
placement and module relocation.
However, it can also be used for software
application development and as a processing resource in a hardware-software
co-design application.
The microprocessor operates at a maximum frequency
of 133 MHz and contains a data and an instruction cache of 8 KB each.
On
the ESM it has access to 64 MB of SDRAM and 16 MB of non-volatile ash
memory. As listed in Table 3.3, the processor bus is also used as an interface to
the crossbar. This is done by connecting the address, data and control signals
of the bus to the crossbar FPGA.
69
3. The Erlangen Slot Machine
Width
Pin
4 bits
nCS[2], RD_nWR,
Description
Control
lines,
notChipSelect2,
Read_notWrite,
nTA, nRESET
32 bits
MPC_A
notTransferAcknowledge, and notReset
Address, shared with Crossbar, Flash, and
SDRAM
32 bits
MPC_D
Data,
shared
with
Crossbar,
Flash,
and
SDRAM
16 bits
misc.
clock
CLKOUT
2 bits
I2C
Table 3.3.:
Connected but unused at Crossbar
PowerPC clock signal
I2C bus for conguration
Interface between the PowerPC and the Crossbar FPGA
The crossbar FPGA is assigned a memory range that is mapped to internal
FPGA registers. This allows the PowerPC to read and write crossbar data like
any other external memory address.
To bring up the system, a boot loader is stored in the internal EEPROM via
the BDM interface. For this, the ESM uses
U-Boot, an Open Source boot loader
[113]. It sets the PowerPC frequency, initializes the SDRAM and loads a ramdisk with the customized Linux from the ash.
3.5.2. Crossbar
The crossbar FPGA switches connections between modules on the main FPGA
or connects them with peripheral devices like Video-In or the PowerPC. Table
3.4 list the signal interface of the crossbar FPGA. The crossbar itself consists
of ve units, as shown in Figure 3.14:
crossbar interconnect, video capture,
PPCcom, HWSWcom, and RCMcom modules. The PPCcom module provides
a low-level communication interface between the crossbar FPGA and the PowerPC microprocessor. The HWSWcom and RCMcom modules provide a communication interface to partial modules on the main FPGA and a communication
channel to the reconguration manager FPGA. The Motherboard also provides
70
3.5. The Motherboard
audio I/O, and an additional SDRAM for the crossbar. The functionality of the
crossbar module will be explained next.
Width
Port
22x4 bits
Scheduling_A-V
Description
Scheduling ags for each micro slots
22x8 bits
XbarIO_A-V
Data I/O for each micro-slot
8 bits
RCM data I/O
RCM controlling by the PowerPC
32 bits
fpga_bus
RGB graphic output and control
signals
8 bits
video_vpo_in
1 bit
video_en
clock
clock_video
clock
clock
3 bits
ppc_nChipSelect2, RD_nWR,
VPO bus with YCbCr data
SAA7113H chip enable
27 MHz video clock
25 MHz VGA clock by PLL
PowerPC
Processor
Local
Bus
signals
ppc_nTransferAcknowledge
32 bits
ppc_address
Shared with PowerPC, Flash, and
SDRAM
32 bits
ppc_data
Shared with PowerPC, Flash, and
SDRAM
clock
9 bits
ppc_clkout
nWE, nCAS, CKE, nRAS, nCS
BA0, BA1, LDQM, UDQM
PowerPC clock signal
Crossbar-SDRAM control signals
13 bits
RAM_A
Crossbar-SDRAM address
16 bits
RAM_D
Crossbar-SDRAM data
clock
RAM_CLK
2 bits
I2C
Table 3.4.:
Crossbar-SDRAM clock signal
I2C bus for conguration
Signal interface of the Crossbar FPGA.
71
3. The Erlangen Slot Machine
Crossbar Interconnect
The main task of signal switching is performed by the
crossbar module which is controlled in software by the PowerPC. The crossbar
connections can be classied into four groups. The rst signal group consists of
twenty-two I/Os signal groups, each 8 bit wide, which are connected to the I/O
pins of the main FPGA. The second signal group connects to the video-input
signals. The next signal group connects the crossbar with the Video-Out FPGA
to output RGB data.
The last one is the hardware-software communication interface, called HWSWCom,
which allows the PowerPC to send and receive data from any connected module on the main FPGA. The data-ow inside the crossbar FPGA is depicted in
Figure 3.14.
To control the crossbar interconnect via the PowerPC, an ESM shell has been
implemented providing the following commands:
• cb_reset : removes all crossbar connections
• cb_connect : connects peripherial I/O pin groups
with partial module
I/Os
• cb_disconnect
: connects peripherial I/O pin groups with partial module
I/Os
• cb_list_connections
: displays all active crossbar connections
To be able to handle the complexity in the crossbar not every single bit permutation but only groups of four successive bits can be switched. For example, to
connect a 10 bit video signal with the main FPGA I/O pin groups H and I, the
following commands are used:
1
2
3
4
5
$ esmshell
> cb_reset
> cb_connect DEINTERLACE81−H5 DEINTERLACE82−H6 DEINTERLACE83−H7
DEINTERLACE84−H8
> cb_connect DEINTERLACE85−H9 DEINTERLACE86−H10 DEINTERLACE87−H11
DEINTERLACE88−H12
> cb_connect DEINTERLACE89−I 5 DEINTERLACE90−I 6 DEINTERLACE91−I 7
DEINTERLACE92−I 8
Listing 3.1:
Example of the ESM-Shell command line interface used to set the
crossbar I/Os at run-time.
72
3.5. The Motherboard
Hardware-Software Communication Interface
communication interface for PowerPC interaction.
The
PPCcom
is the basic
It interprets the incoming
addresses and reads or writes the corresponding data. For instance, on address
0xD0000320 either a register is read or written depending on the control signals.
Just like other peripherals, the
HWSWcom
can be connected to any I/O group
of the main FPGA by the crossbar module.
The HWSWcom module uses a
byte serial transmission. Hence, only 8 bit are occupied although logically the
PowerPC size uses a 32 bit word size. That saves I/O pin overhead for small
partial modules on the main FPGA.
Virtex-II 6000 FPGA
22 data buses, each 8 bit wide
Video-Out
FPGA
RCM
FPGA
Crossbar
Crossbar
HWSWcom
VideoIn
VideoIn
Video-In
SAA7113
RCMcom
RCMcom
PPCcom
PPCcom
PowerPC data and
address bus
Figure 3.14.:
Internal data ow structure of the crossbar FPGA with the currently implemented units and associated signals.
The
PPCcom
module can directly access the conguration registers of the Crossbar module which are used to program the requested connection
the main FPGA and the peripheral devices.
The last submodule is
RCMcom.
This entity handles the data exchange between
73
3. The Erlangen Slot Machine
PowerPC and the FPGA hosting the Reconguration Manager (RCM), see 3.10,
to control the reconguration process and to load conguration bitstreams into
ash memory.
3.5.3. Video Input
The designed Motherboard was built to support the domain of video streaming
applications.
a
The ESM has an analog composite video connector handled by
Philips SAA7113H
video input processor [114].
It converts PAL, NTSC,
and SECAM formats into a digital component video signal and sends it to the
crossbar. The PAL video format denes 50 interlaced half-frames per second.
The rst half-frame holds the even lines and the second half-frame the odd ones.
Overlaying theses two half-frames creates a full frame picture; this process is
called de-interlacing. It creates a 25 frames per second full frame rate from a
stream of 50 interlaced frames per second.
The video input processor's output format is 720x576 pixels in the YCbCr 4:2:2
color model. This means that there is a luma value (Y) dening the brightness
for every pixel but color information (Cb, Cr) only for two successive pixels;
the color information is sub-sampled by two. The images are interlaced. The
SAA7113H is connected to the Crossbar and the I2C bus. For detailed information see [114] and [115].
As already stated, the video input processor sends a component video signal.
But the video will be processed and displayed by a computer, thus, a conversion
to the VGA format is needed.
That function is also included in the
VideoIn
module that also congures the SAA7113H device at start-up. It reduces the
resolution to 640x480 pixels by cropping the overlapping edges at the right and
bottom and converts the YCbCr stream into pairs of 24 bit RGB pixels, because
of the color sub-sampling coding of two successive pixels in the YCbCr format.
The pixel clock is changed to 25 MHz according to the VGA output mode
640x480 @ 60 Hz. As a result, a pixel pair and its coordinates are transferred at
12.5 MHz to the crossbar module. The video frames still have to be buered for
deinterlacing. This function is implemented with the help of two SRAM banks
on the main FPGA.
74
3.5. The Motherboard
3.5.4. Video Output
The output of the dierent video formats like VGA, DVI, TV-Out is controlled
through a separate device, the Video-Out FPGA. It is a smaller Spartan-IIE 400
device that is also connected to two 8 MB SDRAMs which are used as frame
buer. They hold the output image as every video frame is displayed multiple
times depending on the monitor refresh rate. For instance, with the 25 FPS and
60 Hz, each video frame is displayed 2.4 times on average.
After processing on the main FPGA, each video frame is transferred through
the crossbar to the to the Video-Out FPGA. The bus connecting the Crossbar
FPGA and the Video-Out FPGA has 32 bit.
The image has to be sent as a
progressive pixel stream with 24 bits for the RGB data and two of the remaining
eight bits for the control signals
line_begin
and
frame_begin.
Another approach is to use the Video-Out FPGA to control the readout of the
deinterlacing buers on the main FPGA. The images are read on-the-y and
directly sent to the RAMDAC via Crossbar and Video-Out FPGA.
Both implementations use the VGA resolution 640x480 pixel with a refresh rate
of 60 Hz.
The resulting pixel clock, already mentioned at the Video capture
module, is calculated as follows:
P ixel clock =
The
HorizRes∗V ertRes
RetraceF actor
Retrace Factor
∗ Ref reshRate
denes the time ratio the display is blanked during each
frame. This is needed for the retrace of the electron beam of cathode-ray tubes.
For further information can be found in [116] and [10]. The Retrace Factor is
already included in the resolution factors:
P ixel clock = (640 + 160) ∗ (480 + 44) ∗ 60 Hz = 800 ∗ 524 ∗ 60 Hz = 25.152 M Hz
The horizontal active lines are followed by blank pixels including the horizontal
sync pulse
for the monitor. Together with the vertical sync pulse after a com-
plete frame, the resolution and refresh rate can be determined by the monitor.
The blank region before the sync pulse is called
back porch.
front porch,
the region after
This modeling results in a pixel rate of about 25 MHz.
75
3. The Erlangen Slot Machine
76
4. Development of Partially
Recongurable Modules
4.1. Introduction
This chapter will describe the supporting framework for the ESM platform [45].
The goal is to automate the development process and to provide tested and
reusable hardware and software interfaces for application developers. Moreover,
the framework provides a guideline how to develop a hardware design that takes
advantage of partial reconguration modules (PRMs) [21].
In the rst part of the chapter, the standard design ow for partial bitstream
generation will be described. In its current form the partial design ow from
Xilinx [21] is based on a shell script that controls the partial module generation
after synthesis. This script assumes that the hardware design has been already
transformed to a partial design and is synthesized. However, most hardware designs are not written with partial reconguration in mind and the transformation
process for the communication with a partial module requires the insertion of
special macros into the top-level HDL design. In this case the top-level design
le has to be rewritten. The result is an extensive code rewrite of the top-level
design le.
Moreover, partial design ow specic rules and constraints must
be implemented. The chance for manual errors is high during this process, as
many new signals and constraints are introduced during this process. Based on
this experience, the transformation of a standard design into a partial design
can be automated. This idea resulted in the so-called
SlotComposer
tool that
77
4. Development of Partially Recongurable Modules
transforms a standard design into a partial design and generates an automated
partial design ow for this project. The second part of this chapter describes
SlotComposer's automated design ow.
After the bitstream generation, partial reconguration modules must be stored
in local memory on the recongurable platform.
Then each module can be
loaded on demand or according to a given schedule into the FPGA. The control
of the reconguration process as well as module storage handling is implemented
in an operating system framework which is presented in the third part of this
chapter. The basis of the operating system framework is an embedded Linux
on the PowerPC microprocessor. The open source approach allows to take advantage of existing drivers for various peripheral devices. This is an important
feature as the ESM's Ethernet and USB interface chips are not directly supported by the embedded Linux [112, 113] distribution used.
Another important feature of the operating system framework is to provide
standard software APIs for software modules to connect to the underlying hardware. These APIs include the Linux drivers for low level hardware access and
a standard library for communication and reconguration control.
Based on
these APIs, core operating system functions like oorplanning and scheduling
of FPGA slots can be implemented. Having C-based libraries mandates an edit
and compile ow for any changes made to the software application that is not
suitable for interactive testing. For debugging purposes an interactive shell application was implemented. It allows to invoke all API functions directly from
the command line of a terminal session.
As the main operating system is an
embedded Linux OS with full network connectivity, this interactive shell allows
to control and test partial reconguration modules even from a remote host.
In the last part of this chapter we will describe a platform-independent benchmark approach for partial reconguration. The goal is to build a methodology
that allows to measure the operating system's overhead during partial reconguration.
We present a generic and customizable concept for the development
and prototyping of dynamically recongurable hardware tasks that allows to
study and compare dierent scheduling and allocation techniques on dierent
FPGA-based platforms.
78
4.2. Partial Design Flow
4.2. Partial Design Flow
To implement a partial reconguration design successfully, you have to follow
a strict design methodology presented by Xilinx [21]. The guidelines to follow
are:
•
Insert bus-macros between modules that need to be swapped out (called
partial reconguration modules, or PRMs) and the rest of the design
(static logic).
•
Follow synthesis guidelines to generate a partially recongurable netlist.
•
Floorplan the PR Modules and cluster all static modules.
•
Place all in and out signals of a PR Module in bus-macros.
•
Follow PR specic design rule.
•
Run the partial reconguration implementation ow.
To illustrate the dierence between the static base region and the Partially
Recongurable Regions (PRR) a simple partial design is shown in Figure 4.1.
The base or static region is the portion of design that does not change during
the partial reconguration process.
Base Region
PRM A1
Partial
Bitstream A1.bit
PRM A2
Partial
Bitstream A2.bit
PRM A3
Partial
Bitstream A3.bit
PRR A
FPGA
Figure 4.1.:
Partial recongurable design with a single partial recongurable region, PR Region A. Partial reconguration modules PRM A1, A2,
A3 can be loaded into PR Region A. All PRMs of the same PR
Region must have the same communication interface but there are
no constraints on what logic is implemented inside the module.
79
4. Development of Partially Recongurable Modules
PR regions contain logic that can be recongured independently of the base
region and other partial recongurable regions. This logic is called Partial Recongurable Module (PRM). The shape, size and location of each PR region is
dened by the user through a range constraint. Each PR region has one, usually
multiple partially recongurable modules (PRMs) that can be loaded into the
corresponding PR region and share the same communication interface.
Each
partial module is designed and implemented separately using the partial design
ow. The slot terminology used with the ESM architecture refers to a partial
recongurable region. Similarly, the term hardware task or module refers to a
partial reconguration module.
In application note 290 [105] Xilinx presents the design ow for building recongurable designs based on their Modular Design. This ow allows the reconguration of entire columns of Congurable Logic Blocks (CLB) and does not
support static routes through recongurable areas. In a new version of the PR
design ow [21] recongurable modules may span any rectangular area of an
FPGA and static routing can pass through recongurable modules.
The design ow for partial reconguration is shown in Figure 4.2 and consists
of several steps:
•
First the HDL design description has to divided into static and partial
logic.
The top design can only contain signals, bus-macros, I/Os, clock
primitives, static and partial module instantiations.
allowed inside the top design.
No static logic is
All input and output signals of a partial
module must pass through a bus-macro.
•
In the second step design constraints are set for place and route. In addition to timing constraints PR designs mus be constrained with Area
Group, Area Group range, Location and Mode constraints.
The Area
Group constraints must be dened for each recongurable region and for
the static part of the design. They separate clearly the static design form
the logic inside PR Modules.
The Area Group range constraints dene
the shape/size and position for each recongurable region.
The Mode
constraint must be also set for each recongurable region to prevent unexpanded block errors during base and PR module implementation. The
Location (LOC) constraint must be set for all I/O pin clock primitives and
bus-macros. Bus-macros must be located so that they touch the boundary
between the PR region and the base design.
80
4.2. Partial Design Flow
•
The third step in the PR design ow is not required but is recommended
before moving to the PR design implementation.
This step implements
the design in the non-PR ow and is important for placement analysis. It
helps to determine the best Area Group range and bus-macro locations.
The Mode constraint should be removed during this step.
•
The next step is to analyze both the timing and placement of the design.
This analysis is needed to establish the best PR region shape, size and location. Timing analysis is used to nd paths that fail the constraints. This
can happen due to not optimal location or shape of the PR region. Wrong
or not eective bus-macro placement is also often a source of problems.
•
In step ve the base design is implemented. During base design implementation, the synthesized top-level of the design is merged with the static
part of the design and a static.used le is generated. This le contains a
list of routes within the PR regions that cannot be used by PR modules
because they are required by the static part of the base design.
•
The sixth step implements PR modules separately within its own directory
hierarchy. If the static.used le changes, then each PR modules must be
reimplemented.
•
In the nal PR design ow step the top, static and partial modules are
merged to build the complete design. Partial bitstreams are created for
each PR module and one full bitstream for the PR module merged with
the base design.
The standard approach to the partial reconguration design ow is to write a
script le that implements all steps.
The second constraint step will always
remain manual, as it species designers input.
A partial automation of the XILINX design ow can be achieved with the hierarchical oorplanning and design tool PlanAhead [106, 117, 118, 28].
The
PlanAhead software has a graphical user interface that allows to dene and
change the size and shape of the partial modules. It allows also to change the
placement of internal FPGA resource, like registers or BlockRAMs, when timing
constraints are not met because of timing violations caused by partial module
boundaries. Additionally, PlanAhead performs automatic design rule checking
and can generate a partial bitstream for the design.
81
4. Development of Partially Recongurable Modules
Figure 4.2.:
The Partial Reconguration design ow consist of seven steps. HDL
design description and synthesis is the rst step.
The constrain
step (2) can be rened after the optional non-PR implementation
(place and route) step (3) of the top-level design.
Main sources
of problems are violations in Area Group (AG) constraints.
The
implement base design step (5) combines bus-macros, the static part
and I/O constraints in a base design. In step six all PR Modules are
placed and routed within their Area Group constraints. Merge step
(7) creates the bitstreams for the base design and all PR Modules.
But there are some drawbacks. First, it does not support multiple implementations for one PR region [119] and the designer has still to create a new project
for each PR module. Second, for generating partial recongurable designs using
PlanAhead, all input VHDL les have to be synthesized manually. The PlanAhead needs the static netlist les as well as the bus-macros and the constraints
as input. So for each le, a synthesis project has to be created. The ow executes in three phases: Initial Budgeting, Active Module Implementation and
Assembly that corresponds to steps ve, six and seven shown in Figure 4.2. Additionally area group and location constraints can be set in the graphical user
interface.
In the Initial Budgeting phase, it performs the steps translate, map and place &
route only for the static components, producing a design with areas not containing any logic for the PR modules and information about wires routed through
the PR region (in the le static.used). In the Active Module Implementation
phase translate, map and place & route are carried out for one implementation
82
4.3. The SlotComposer
of each PR region.
In a nal step, the Assembly phase, the PR modules are
merged and the bitstreams are generated. To produce more than one bitstream
for a PR region, another PlanAhead project containing one set of not yet implemented PR modules has to be created. Additionally, the static.used le has
to be manually copied into the project.
Using scripts or PlanAhead for the automation of the partial reconguration
ow is a good idea. However, both approaches do not provide any support in
the HDL design or constrain step. The HDL design for PR design ow requires
explicit insertion of bus-macros in the HDL top-level le. Writing design that
contain several PR modules with many input/output signals will require the instantiation of many bus-macros and even more intermediate signals. Moreover,
all bus-macros must be constrained to a correct location. Our experience has
shown that these steps are time consuming and a source of placement errors.
4.3. The SlotComposer
SlotComposer is a tool developed for an automated bitstream generation of
partial modules.
Moreover, SlotComposer converts a standard VHDL design
to a partial design by modifying the top-level design le and by generating
new constraint les for the PR design ow. Then it generates all design ow
scripts and infrastructure to generate standard and partial bitstreams. Using
the existing Xilinx Partial Reconguration Tool Flow, SlotComposer inserts
bus-macros between each input and output signal of a partial module. Its user
interface is depicted in Figures 4.4 and 4.5.
Based on users specication SlotComposer can also connect partially recongurable modules to the Recongurable Multiple Bus or the Crossbar interface. At
the same time it generates all necessary constraint les and optimizes the usage
and placement of bus-macros. Moreover, all required scripts for the synthesis of
all components and the PR-Flow are generated, as shown in Figure 4.3.
To use the PR-Flow from Xilinx, the hardware design has to follow a specic
le structure.
The top-level design le instantiates the static and all partial
reconguration modules (PRMs), global logic, I/O-Ports and clock primitives.
Furthermore, the top-level design le also describes the inter-module communication pattern. SlotComposer eases the transformation to the PR design ow
83
4. Development of Partially Recongurable Modules
by an automated instantiation of the communication structures in the top-level
HDL design le.
PR Modules, Top level design,
Constraint file, bus macros
SlotComposer
1) Insertion of bus-macros
in top level VHDL
New top level
design file
2) Partial Region shape
extension
New constraint file
3) Location of bus-macros
4) Creation of project
directory structure
5) Generation of design
flow scripts
Figure 4.3.:
Partial and base
bitstreams
Based on a modular design SlotComposer automatically inserts and
places bus-macros inside the top-level VHDL design.
Bus-macros
are correctly connected in between static and partial modules. The
shape of a partial module can be changed to create valid locations
for bus-macros. Then a new project directory structure is created
together with the partial design script for partial and base bitstream
generation.
To minimize the number of resources SlotComposer packs as many signals into
a bus-macros as possible.
As bus-macros can only be placed between adja-
cent modules, SlotComposer can adjust the placement of modules to meet this
requirement.
If the boundary between two neighboring modules is not long
enough to insert all required bus-macros, SlotComposer can change the size of
the recongurable region to satisfy this condition and modies the UCF le.
Special tags have to be included as a comment in the HDL le for SlotComposer
to recognize the partial recongurable regions. For example, if the design has
two recongurable area groups then the label '
PR Module 1 ' is inserted before
component instantiation of the partial module for the rst recongurable area
84
4.3. The SlotComposer
group and '
PR Module 2 '
before the module for the second area group. In
order to instantiate the correct bus-macros in the top-level design the position
of each area group must be provided in the UCF le.
Figure 4.4.:
SlotComposer application allows to convert modular VHDL designs
into partial designs. After the selection of the project directory, user
constraints le, FPGA device type and bus-macros the project can
be converted to adhere to the PR design ow.
As the PR Flow is dened by Xilinx and subject to further changes the SlotComposer uses a template engine for the generation of the PR ow scripts. If
changes are to appear in the PR ow then only the template les for the partial
design ow must be changed and not the tool itself.
SlotComposer was tested successfully with three modular applications without
bus-macros. Bus-macros were automatically inserted in the top-level VHDL le.
The generated PR ow script successfully built all partial and static bitstreams.
These applications included:
•
XUP Color Counter is a demo project for Xilinx XUP Virtex II Pro Development System [120].
The Color Counter application consists of ve
85
4. Development of Partially Recongurable Modules
static modules, each of them with its own Area Group property, and one
partially recongurable region with two dierent partially recongurable
modules. The recongurable region is connected to one of the static modules via three horizontal bus-macros.
•
The second design uses the Recongurable Multiple Bus (RMB) as a communication interface among four partially recongurable regions.
Each
region has one recongurable module. Each of the recongurable regions
is connected to the RMB via a group of vertical bus-macros.
•
Video lter is an ESM application composed of a static part and one
PR region with partially recongurable lter modules.
The PR region
has an Area Group property, while the static part of the design can be
specied either with an AG property or without. When the static part of
the design has no area group property then bus-macros are placed on the
left boundary of the recongurable region. The static Area Group range
is created in the lower left corner of the FPGA device and extended in
horizontal and vertical direction until touching the PR region. All busmacros are then placed on the boundary.
4.4. Operating System Framework
Hardware tasks denote partially recongurable modules with an additional control interface. The Operating System Framework implements a working environment for the development of hardware tasks. It builds an abstraction layer
hierarchy for software as well as hardware application development on the ESM
platform.
Operating system tasks like partial module instantiation, run-time
module relocation, reconguration scheduling, and inter-module communication management can be implemented on a microprocessor, as these tasks are
control oriented and have a low computational density. A hardware implementation of the operating system is possible but it lacks run-time exibility, occupies
valuable hardware resources and provides no substantial performance benet.
Managing partial modules on the FPGA, conguration data, module execution
requests is performed in software. The corresponding functionality of the ESM
software framework is depicted in Figure 4.6. The ESM Motherboard and the
ESM Babyboard are the basis for this Operating System Framework [49]. On
86
4.4. Operating System Framework
Figure 4.5.:
SlotComposer application allows to convert modular VHDL designs
into partial designs. This window of SlotComposer shows one static
module on the left and three partial modules on the right side. Busmacros are shown as small boxes connecting these modules together.
The absolute placement of bus-macros and all modules is represented
by the grid position measured in slices.
top of the ESM boards hardware designs or rmware for the reconguration
manager, crossbar and hardware task framework are implemented.
They are
loaded only once but can be easily replaced by variants during system start-up.
The reconguration manager, as described in Section 3.4.2, is implemented as
a custom MicroBlaze architecture containing modular accelerators for memory
transfers, bitle reallocation and conguration management. The crossbar module is responsible for the communication between Motherboard and Babyboard
as well as for control of FPGA's I/O connections to peripheral devices like video
and audio I/O. Partial reconguration modules or hardware application tasks
are built upon the hardware task framework and represent the run-time adaptable part of an application. The hardware task framework provides a control
interface in hardware for each partially recongurable module (PRM). Each
PRM that implements this control interface is called a hardware task.
The software layer consists of the Linux Kernel drivers at the bottom and is
87
4. Development of Partially Recongurable Modules
Reconfigurable Application
ESM Shell
Custom Scheduler & Placer
Hardware
Tasks
Scheduling & Placement Framework
ESM Devices Layer (ESMDL)
Hardware Task Framework
Reconfiguration
Manager HW-Module
Crossbar HW-Module
ESM Platform: Virtex-II FPGA, Crossbar FPGA, External I/O Devices
Figure 4.6.:
Firmware stack developed for the Erlangen Slot Machine. A recongurable application running on the ESM includes a custom scheduler and placer as well as a pool of hardware tasks. Hardware tasks
are partially recongurable modules with an additional control interface.
built around a modied version of the DENX ELDK Linux 2.4 kernel [112] and
U-Boot bootloader [113]. The Linux kernel drivers are responsible for low level
hardware communication with the crossbar through the use of memory mapped
I/Os. They also store and manage the current status of the crossbar and the
reconguration manager.
The software API is based on these Linux kernel
drivers and is a C-based library containing scheduling, placement, hardware
reconguration and management functions for the ESM platform.
The crossbar provides multiple communication channels for any kind of data
transfer between the main FPGA one side and the PowerPC microprocessor or
peripheral devices on the other side.
For testing purposes a shell like application has been created. This ESM-Shell
encapsulates the C-based API functions into scriptable commands analog to the
Bash-Shell. This opens up the possibility for shell script based testing or the
direct manipulation of the ESM system during run-time.
88
4.4. Operating System Framework
The rst group of ESM shell commands is used to set or reset internal crossbar
connections. Before a partial module is loaded into the main FPGA all for the
new partial module relevant I/O connections must be set. All functions available
in the ESM shell are also implemented in the software API of the middleware
layer.
• pin_connect connectionlist : connect all pins in the list (example: pin_connect
D1-VGAB1 D2-VGAB2 D3-VGAB3 D4-VGAB4)
• pin_disconnect connectionlist : disconnect all pins in the list (example: pin_disconnect
D1-VGAB1 D2-VGAB2 D3-VGAB3 D4-VGAB4)
• get_pin_connections : returns a list of all connected pins with connection part-
ners
• reset_crossbar : delete all crossbar connections.
The second group of ESM shell commands implements basic reconguration
management functions.
• rcm_write_ash address [length] bitle : write a bitle to ash memory at ad-
dress address. Length denition is optional.
• rcm_write_fpga bitle : write a bitle directly to FPGA.
• rcm_write_fpga_from_ash address length : write a bitstream located at ash
position address an length bytes long to FPGA.
• rcm_read_ash address length bitle : read length bytes of ash memory at
address address and write it to le bitle.
• rcm_write_ash_reloc address [length] bitle reloc_oset : write a bitle to
ash memory at address address and relocate it by reloc_oset. Length denition is optional.
• rcm_write_fpga_reloc bitle reloc_oset: write a bitle directly to FPGA and
relocate it by reloc_oset.
• rcm_write_fpga_from_ash_reloc address length reloc_oset : write a bitstream
located at ash position address and length bytes long to FPGA and relocate
it by reloc_oset.
89
4. Development of Partially Recongurable Modules
• rcm_read_ash_reloc address length bitle reloc_oset : read length bytes of
ash memory at address address, relocate it by reloc_oset and write it to le
bitle.
• rcm_erase_ash_block address length : erase ash block(s) beginning at address
for length byte.
• rcm_reset_modules FLASH|FPGA|FLASH_AND_FPGA: resets either the ash
module of the rcm, the FPGA module or both.
• rcm_read_status : print status information of the RCM.
The last group of commands can be used to size up and write register of the
crossbar communication interface. They are useful to verify the currently used
communication settings between the crossbar and the PowerPC microprocessor.
• hwswcom_setbit bytenumber bitnumber : set bit bitnumber in byte bytenumber
in Hardware-Software-Communication memory (intention: set input to FPGA
pins to 1).
• hwswcom_clearbit bytenumber bitnumber : clear bit bitnumber in byte bytenum-
ber in Hardware-Software-Communication memory (intention: set input to FPGA
pins to 0).
• hwswcom_getbit bytenumber bitnumber : check if bit bitnumber in byte bytenum-
ber in Hardware-Software-Communication memory is set .
4.5. Real-time Recongurable Hardware Task
Management
To demonstrate the benet of partial reconguration for real applications, the
development of special operating system concepts is necessary in order to address the peculiarities of modularization, inter-module communication, reconguration scheduling and time-dependent allocation of resources with respect
to dierent, often also real-time constraints.
Unfortunately, research work so
far does not consider or reect the real underlying hardware and works just on
ctive numbers or abstract mathematical models of hardware tasks, respectively
module parameters.
90
4.5. Real-time Recongurable Hardware Task Management
We present a methodology for generating a synthetic benchmark for dynamically
recongurable hardware task bitstreams parameterizable in a) module size, b)
execution time, c) arrival time, and d) possibly deadline, as well as a methodology for wrapping and modularizing arbitrary user hardware modules into such
exible callable and loadable module concept for dierent FPGA platforms.
A scripted tool supporting the generation of such modules makes the generation
of dynamic hardware task easier and research on reconguration scheduling and
allocation more comparable.
In the real-time machine scheduling and embedded systems research community,
Dick et al.
have reached a big breakthrough with their 1998 seminal paper
TGFF [121]: Task Graphs for Free oering a user-controllable, general-purpose
generator of synthetic software task graphs that is heavily used in the realtime scheduling research ans system synthesis areas.
The benet of such a
benchmark concept is that by sharing of parameter settings, researchers may
easily reproduce examples and case studies used by others, regardless of the
platform.
The intention is to provide a similar concept and methodology also for the domain of recongurable computing on FPGAs. In particular, we present a generic
and customizable concept for the development and prototyping of dynamically
recongurable hardware tasks that allow us to study and compare dierent
scheduling and allocation techniques on dierent FPGA-based platforms in a
reproducible manner. With this respect, we present
•
a methodology for generating synthetic benchmarks for dynamically recongurable hardware task bitstreams parameterizable in
arrival time,
(core) execution time,
reconguration time (indirectly through specication of module size),
and
•
deadline, as well as
a methodology for wrapping and modularizing arbitrary user hardware
modules into such exible callable and loadable module concept for different FPGA platforms.
The goal is to make dynamic hardware task
91
4. Development of Partially Recongurable Modules
generation a part of our methodology to support automated generation of
such callable and freely relocatable partial bitstreams.
In the following, we present an environment for the generation of partially recongurable tasks suitable for benchmarking. We assume that a recongurable
system consists of a recongurable device which is linked via a hardware abstraction layer to an operating system. The operating system layer contains a
task scheduler and a placer as well as a reconguration manager connected to a
task repository. The task scheduler decides a starting time for each task, based
on an application specic scheduling policy. Figure 4.6 presents an abstracted
view of such a recongurable system. An o-line schedule is suitable for statically dened applications, whereas on-line schedulers are suitable for problems
with dynamic, i.e., event-based computation requests. The placer keeps track
of all run-time assigned resources and initiates loading and unloading of tasks,
based on the scheduler output, via the reconguration manager.
The reconguration manager is responsible for storage and caching of bitstreams
which represent dierent hardware modules. Caching of bitstreams in SRAMs
or fast Flash memory devices allows the operating system to decrease the reconguration time for each hardware module load process.
The aim is to physically measure the start to end execution time of a task and
to compare them with simulation results across a wide range of recongurable
platforms.
We propose an automated generation of generic and congurable
hardware tasks for this purpose.
Hardware tasks are partially recongurable modules with an additional control
interface. As shown in Figure 4.6, a recongurable application running on the
ESM includes a custom scheduler and placer as well as a pool of hardware tasks.
i is parametrized by the following
Ai , start time Si , enable signal Ei , reconguration time
time Ci . The arrival time Ai is the time at which the
In our approach, each hardware task (module)
attributes: arrival time
Ri ,
and core execution
request for the execution of a module becomes known to the reconguration
manager.
If an empty slot is found on the FPGA device for the requested
task, then the start time
Si
denotes the beginning of the partial reconguration
process for this task. The reconguration time
Ri
itself depends on the size of
the bitstream being loaded, the speed of the reconguration interface, and the
software overhead in the operating system layer. Figure 4.7 shows the time scale
of events.
92
4.5. Real-time Recongurable Hardware Task Management
Figure 4.7.:
Time line showing the arrival of a task request, its reconguration
and execution time.
the enable signal
The execution is enabled separately through
Ei .
An example of an active device supporting
partial reconguration at run-time is shown in Figure 1 c).
The operating system layer implements a basic set of software functions needed
for running these benchmarks. These functions must include bitstream loading
reset, setState and getState functions for each partial modsetState and getState are for control and monitoring of the hardware
operations as well as
ule. The
task state. The life cycle of each task consists of ve states, such as not active,
scheduled, loaded, running and idle. The state diagram in Figure 4.8 shows the
FSM
Version
1.2
life
cycle
of a hardware task.
6/17/2010 10:11:13 PM
no request
not active
request A_i
remove Z_i
scheduled
idle
reconf igure S_i
enable E_i
loaded
done
running
enable E_i
not done
Figure 4.8.:
State diagram showing the life cycle of a hardware task.
93
4. Development of Partially Recongurable Modules
As long as a hardware task is not loaded it resides in the default not active
state. To remove a hardware task the partial reconguration region is reloaded
with a blank module. An example time line of events is shown in Figure 4.7.
4.5.1. Hardware Task Generation
We implemented the generation of partially recongurable hardware tasks in
the so called design tool
hwtaskgen
which generates partially recongurable
VHDL tasks. In order to test scheduling and allocation algorithms all task have
common interface which is a must for partial reconguration.
Synthetic hardware task generation is the
main mode
for the generation of
reusable hardware modules. In this mode, a task set of concurrently running
hardware modules can be generated. No overlapping in the placement is allowed,
as this would generate errors during the place and route phase of the design. If
a big task set of modules has to be generated that does not t physically on the
device, then several independent task sets have to be created. This set is also
necessary if run-time relocation of modules is not implemented in the operating
system layer. In this case, dierent module placements can be implemented for
one task set if possible. Otherwise, additional task sets have to be created.
In the second
wrapper mode,
existing hardware modules may be wrapped by a
hardware interface that includes a communication interface for the operating
system to create a partially recongurable hardware task. In this case the hardware module interface is extended by a few signals necessary for benchmarking
purposes.
The original module interface is not changed.
In particular, this
concept holds for any hardware modules that perform a function evaluation.
Figure 4.9 presents a view of a recongurable device populated with a generated
synthetic benchmark set with three hardware tasks (HW-T1, HW-T2, and HWT3).
Each hardware task is connected via Xilinx specic bus-macros to the
static part of the design. All hardware tasks are run-time recongurable while
the static part runs uninterrupted during the reconguration process.
Each
task has its own I/O interface including two additional control signals.
enable
The
signal is set by the operating system layer after the hardware module has
been successfully loaded. It enables the execution if set, otherwise execution is
disabled. The
94
done
output signal is set once the task execution is nished.
4.5. Real-time Recongurable Hardware Task Management
Figure 4.9.:
Generated hardware task set consisting of three modules (HW-T1,
HW-T2, HW-T3) with dierent module widths. All signals between
the static part and modules pass through bus-macros.
4.5.2. Design Flow
Our design ow is based on the Xilinx Early Access Partial Reconguration
(EAPR) Flow [21] that supports the creation of partial hardware modules. Using
EAPR requires the creation of one static module where all partial modules are
connected to, as shown Figure 4.9. This means that all signals of any partial
module connected to external pins have to go through the static part of the
design.
The standard method for communication is using bus-macros which
connect partial modules with the static part of the design and can also be used
for inter-module communication.
The design ow for partially recongurable real-time task is composed of three
main parts, namely
•
partial hardware module creation, including module allocation, reconguration, and scheduling, and a
•
operating system layer running on a Soft- or Hardcore CPU,
95
4. Development of Partially Recongurable Modules
•
HW-SW interface between hardware modules and the operating system
layer.
This operating system layer is build on top of a hardware abstraction layer
which encapsulates the HW-SW interface. In its implemented form, the HWSW interface contains only a memory mapped register set. This register set is
used by the hardware modules and the PowerPC microprocessor to read and
update status information of each hardware task.
Software API
The scheduling environment consists software library that in-
cludes all platform specic functions to manage the execution of partial recongurable hardware tasks on the FPGA. The device independent part, such
as scheduling and placement algorithms, are integrated into a Hardware Task
Scheduler software library.
This library can be reused on other platforms as
long as a similar software library is available on the target's microprocessor.
In more detail, the software library includes functions to manage bitstreams in
a module cache, if a cache memory is supported on the recongurable platform.
The
addToModuleCache
and the
removeToModuleCache
functions can be used
to add new bitles and remove existing les from the cache memory, respectively.
The
listModuleCache
command can be called to receive a list of all bitstreams
currently available in the module cache. A call to
resetModuleCache
clears the
content of the whole module cache by removing all modules.
The most important function is
loadModule,
which loads a given bitstream on
the main FPGA at the given position. If the partial module is available in the
module cache, then the bitstream is loaded from there into the main FPGA.
Otherwise, the bitstream has to be transferred from an external source, for
example a network directory, to the main FPGA. This operation can last for
some time and should be avoided by storing partial module bitstream in the
cache memory up-front. The
unloadAllModules
function can be called to reset
the FPGA and remove all modules placed on the recongurable hardware.
1
2
3
4
5
// Module Cache R o u t i n e s
int addToModuleCache ( s t r i n g b i t f i l e N a m e ) ;
int removeFromModuleCache ( s t r i n g b i t f i l e N a m e ) ;
l i s t listModuleCache ( ) ;
int resetModuleCache ( ) ;
6
7
// R e c o n f i g u r a t i o n R o u t i n e s
96
4.5. Real-time Recongurable Hardware Task Management
8
9
int loadModule ( s t r i n g b i t f i l e , int p o s i t i o n ) ;
int r e s e t A l l M o d u l e s ( ) ;
10
11
12
13
// Module S t a t e R o u t i n e s
moduleStates getModuleState ( int p o s i t i o n , optionalStateToWaitFor ) ;
int setModuleState ( int p o s i t i o n , moduleStates v a l u e ) ;
Listing 4.1:
Summary of used software library functions supported by the operating system layer.
Additionally, two more functions are used to control a hardware task.
getModuleState
The
routine can be called to poll if a loaded module at a specied
position has nished its execution or is still running.
Optionally, a second
parameter can be specied. In this case, the function does not return until the
task entered the corresponding state. The
setModuleState
function is used to
set the task state. In particular, the execution of a running task can be stopped
and periodic tasks can be activated or deactivated at specied times. As a side
eect this allows to time-multiplex running modules on the main FPGA.
Evaluation of Reconguration Overheads
To optimally schedule a set of
tasks on a recongurable platform, the reconguration time overhead must be
carefully studied and taken into account.
The reconguration overhead for a
hardware task determines whether dynamic hardware reconguration makes
sense at all for certain applications.
Therefore, we developed a measurement
software which loads hardware tasks bitstream of dierent module sizes
Wi
(slot
Ci , each at a time on the main
FPGA. The software measures the time form the point when the loadModule
command is issued in software to the point where the getModuleState command
widths in CLBs) and known core execution times
returns that the module is loaded and ready. This way, the reconguration time
overheads caused by a) the software layer, b) by transferring the bitstream from
the module cache or from the memory of the control CPU, and c) the actual
loading on the FPGA are quantitatively determined. For the ESM platform the
Figure 4.10 and Table 4.1 show these reconguration times
Ri
depending on
the size of the module to be loaded. For example, modules with the width of
8 CLBs are recongured in 127 ms. In all cases, the software overhead for the
reconguration was measured with 3 ms. The same measurement strategy can
be used on other platforms as long as a similar software library is available.
97
4. Development of Partially Recongurable Modules
Hardware task width in CLBs
Reconguration time in ms
4
97
Table 4.1.:
8
127
12
168
24
332
36
484
48
673
60
837
72
1013
Reconguration overhead on the ESM platform for dierently sized
partial modules. All hardware tasks are loaded from ash memory
directly into the main FPGA. The software overhead is very small
because only one command has to be sent to the Reconguration
Manager to load a partial module from ash memory.
1000
Reconfiguration tim in ms
800
600
400
200
0
0
12
24
36
48
60
72
Reconfigurable module width in CLBs
Figure 4.10.:
Measured reconguration times for generated hardware tasks with
dierent module widths. Hereby, a constant time overhead of 3 ms
was resulting from the software layer.
Task Scheduling
To measure the reconguration overhead and to compare
it with the theoretical value, ten independent tasks are generated. In the next
step, a schedule minimizing the makespan of the task set must be found. All ten
hardware tasks have the same width of 8 CLB columns and can be loaded into
any of the six available slots on the FPGA. Each task's core execution time is a
98
4.5. Real-time Recongurable Hardware Task Management
multiple of the measured reconguration time of 127 ms. This is the same time
required for the partial reconguration of an 8 CLB wide hardware task. Either
an on-line scheduling algorithm based on software library functions may be used
or an o-line scheduling algorithm may be run. In our example, we decided on
the latter approach in order to determine the optimal schedule for the example
problem. Furthermore, currently available FPGAs have the limitation, that just
one task can be recongured at the same time. This additional reconguration
constraint must be taken into account.
Our optimal schedule minimizes the
makespan while respecting the reconguration overheads, as shown in Figure
4.11.
Hereby, tasks
3, 5, 1, 6, 8, 9
can't start simultaneously because only one
task can be recongured at the same time. According to the theoretical schedule
results, the total execution time of these tasks on our recongurable platform
24 ∗ 127 ms = 3.048 s.
reconfigurable slot position
should be
time steps
Figure 4.11.:
Schedule produced for the example problem by our scheduling simulator. The brightly shaded rectangular areas stand for the reconguration times
Ri ,
the green ones for the core execution times
Ci .
The resulting schedule can be represented by a list of pairs in which the rst
member species the task index and the second the corresponding slot position
on the recongurable device.
be recongured rst.
list equals
The rst pair in the list species the task to
For the example schedule in Figure 4.11, the schedule
{(3, 1), (5, 2), (1, 3),(6, 4), (8, 5), (9, 6), (7, 2), (10, 5), (4, 1), (2, 3)}.
A
loader function, see Listing 4.2, does the partial bitstream loading according to
99
4. Development of Partially Recongurable Modules
the slot position found earlier during scheduling. It rst removes the rst item of
the schedule list and checks if a module occupying the destination slot position
has nished its execution.
Only if the ag
Done
is set the
getModuleState
function terminates and the next hardware task can be loaded.
hwtaskgen was used. The
parameter is a list of triples containing the slot position pi of the hardware task
on the FPGA, width Wi and the execution time C . An example parameter list
for n tasks looks as follows: {(p1 , W1 , C1 ), (p2 , W2 , C2 ), ..., (pn , Wn , Cn )}.
To generate the above example task set, our tool
1
2
3
4
5
6
7
8
9
10
11
loader ( PairList listSchedule )
{
while (0 < l i s t S c h e d u l e . s i z e ( ) ) {
nextPair = l i s t S c h e d u l e . removeFirst ( ) ;
nextTaskIndex = n e x t P a i r . index ( ) ;
nextTaskPos = n e x t P a i r . p o s i t i o n ( ) ;
i f ( getModuleState ( nextTaskPos , DONE) {
loadModule ( b i t f i l e N a m e [ nextTaskIndex ] , nextTaskPos ) ;
}
}
}
Listing 4.2:
Loader function to recongure the hardware tasks according to an
o-line schedule.
For the above example task set, a total execution time on the real physical
hardware was o by only 3.67 %.
makespan of this task set was
In particular, the measured time for the
3160 ms versus 3048 ms for the optimal execution
time of all ten tasks. The dierence between theoretical and experimental results
may be further reduced by improving the deterministic behavior of our hardware
and software implementations. Our embedded Linux kernel [112] does not use
any real-time extension, that would allow a more deterministic software IRQs.
The operating system overhead for task scheduling approaches can have a significant impact on the makespan performance of a executed task set. Our analysis
and comparison of scheduling simulation and a physical task set execution on
our platform demonstrated that the operating system impact on the makespan
performance was below 6.04 % or 3 ms. It must be noted that the results were
achieved on the ESM platform.
However, with the
task set can be generated for dierent platforms.
hwtaskgen
tool the same
The main goal of this tool
is to encourage the comparison of recongurable platforms in terms of their
100
4.6. Hardware Interfaces for Video Processing
practical reconguration performance and to reproduce examples used by other
researchers. Moreover, we presented a standard method for rapid generation of
synthetic, partially recongurable hardware task sets which enable the generation of scheduling benchmarks for various recongurable computing platforms.
4.6. Hardware Interfaces for Video Processing
In this section, components of the Erlangen Slot Machine important for video
processing will be highlighted. For the design of the hardware interfaces, the
application requirements placed on the whole system are considered.
4.6.1. Overview
The rst step in implementing a video processing application on the ESM platform is to provide video input and video output to the main FPGA which hosts
the hardware processing modules. Depending on the location of the recongurable regions, the crossbar has to switch the video input and video output to
the correct slot positions on the main FPGA. The resulting images are then
overlayed with visual guides before being displayed. To separate this task from
the video application an output frame buer is required. Displaying additional
information inside the video image is done by writing the information directly
into the frame buer. Modules processing the video stream on the main FPGA
need fast memory buers to store the images for convolution and intermediate
results. Some of the algorithms can use a hardware-software co-design approach
by using the PowerPC for software computations, as shown in Figure 4.12. In
this case, the hardware-software communication must be additionally instantiated to send and receive data from the software part of the application.
As an additional objective, the video processing modules must not occupy the
whole area of the main FPGA to allow other recongurable modules to run in
parallel. The aim is to demonstrate the ESM architecture's capability for parallel computing as well as the support for partial reconguration - reprogramming
logic cells while not interrupting the circuits of other parallel applications modules.
101
4. Development of Partially Recongurable Modules
Figure 4.12.:
Simplied structure of a video processing application designed for
the ESM platform. In its basic form, the video processing module
is connected to an input and output module.
These three mod-
ules reside on the main FPGA and require external memory. The
communication to and from the main FPGA is controlled by the
Crossbar FPGA.
The requirements for running a video processing application on the ESM platform can be summarized as follows:
•
an optimized video I/O system for VGA resolution of 640x480 pixels at
60 Hz,
•
fast memory interfaces for buering,
•
hardware-software communication, and
•
available resources on the main FPGA for other recongurable applications.
102
4.6. Hardware Interfaces for Video Processing
4.6.2. HW/SW Communication
The main FPGA and the PowerPC can only communicate via the crossbar. The
PowerPC is connected to the crossbar FPGA with its full 32 bit address and
data bus. The crossbar contains a register bank that is transparently mapped
into the address space of the PowerPC. Access to these registers is implemented
through standard memory read or write instructions. The software drivers are
implemented as Linux character device drivers.
They control the access to
the crossbar register bank and are responsible for read and write access to the
memory mapped I/O registers. In order to ease the software interface, a custom
software library is provided.
This library implements functions necessary for
sending or receiving whole data buers to and from the main FPGA.
4.6.3. Video Input
The ESM oers an analog composite video input that is handled by the video
input module located inside the Crossbar FPGA. One question is to how to
deinterlace the incoming video frames. For the implementation of the deinterlacing scheme two options are possible.
Either, the SDRAM at the crossbar
FPGA is used to deinterlace the image directly after the RGB conversion, or
the deinterlacing is implemented on the main FPGA. In this case, an SRAM
and some logic area of the main FPGA are utilized for deinterlacing.
As discussed in [10], a method for deinterlacing a video stream is the
mode.
weave
Here, the lines of both elds are weaved into each other. With an output
frequency of this module
fpixel = 25 M Hz
fpair = 12.5 M Hz
and the required VGA pixel clock
as well as the incoming video stream bit width
and output bit width after the RGB conversion
wRGB = 24 bit
wpair = 48 bit
the throughput
of the deinterlacing node is specied as follows:
Rdeinterlacing = fin ∗ win + fout ∗ wout = fpair ∗ wpair + fpixel ∗ wRGB = (12.5 M Hz ∗
48 bit) + (25 M Hz ∗ 24 bit) = 600 M bps + 600 M bps = 1.2 Gbps
This data rate can be reduced by converting the color images to grayscale.
Hence, the data rate can be reduced to one third by converting 24 bit RGB
pixels to 8 bit grayscale pixels. The optimized video input has then the following
throughput requirements:
103
4. Development of Partially Recongurable Modules
0
0
0
= fpair ∗wgraypair +fpixel ∗wgray = (12.5 M Hz∗
+fout ∗wout
= fin ∗win
Rdeinterlacing
16 bit) + (25 M Hz ∗ 8 bit) = 400 M bps = 1/3 Rdeinterlacing
The reduced throughput enables the overall realization of a video processing
pipeline, as memory has to be split among the subsystems.
4.6.4. Video Output
After passing the processing modules on the main FPGA, the images are sent
to the display frame buer on the Motherboard. The VGA output mode is set
by the Video-Out FPGA to 640x480 at 60 Hz. For simple video lters the direct
output of the video stream can be implemented. For this to work, the processed
image buer must be read 60 times per second and the pixel stream has to be
passed through to the RAMDAC device.
However, the start of a new image
on the VGA output is not synchronized with the output process of the video
processing module. This leads to visible image change, mainly during pan and
tilt of the camera, in the middle of the displayed image because the display of
the next image frame is not synchronize with the vertical sync signal. To avoid
this, the video images must be stored in a frame buer before being displayed
through the VGA.
A dedicated output logic is required to control the frame buer. Visual guides
implemented in hardware on the Video-Out FPGA can be used to draw additional information on top of the video images. They are merged with the video
stream resulting in a 24 bit RGB output for the VGA signal generation in the
RAMDAC device.
Upon the rst pixel of a frame a frame begin signal must
be set. The frame buer is implemented on the Video-Out FPGA and uses the
two 8 MByte SDRAMs.
4.6.5. Memory Interfaces
Video image processing requires access to large image memory. When convolution lters are applied on an image every pixel inside the sliding window has
to be accessed. The internal BlockRAM of an FPGA is good choice for storing images because it supports dual ported access. However, its size is limited
and a grayscale image of 640x480 pixels does not t into the distributed RAM
104
4.6. Hardware Interfaces for Video Processing
of the complete FPGA. Hence, external memory buers must be utilized. To
fulll the real-time conditions without skipping any frame, they must be fast
enough to accept incoming image data and at the same time output data for
the next process. This condition results in same throughput requirement found
for the deinterlacing node,
Rmemory = 400 M bps.
External memory resources
accessible from the main FPGA are its six SRAM banks, each two MByte in
size. One SRAM module has a data width of 8 bit and must be clocked at least
at
fSRAM = 50 M Hz .
105
4. Development of Partially Recongurable Modules
106
5. Application Scenarios and
Use Cases
5.1. Introduction
This chapter presents real application scenarios for the developed ESM platform.
Three adaptive video based applications use run-time partial reconguration to
demonstrate adaptive functional behavior on the ESM. The rst application
is a video based car lane detection system, which can be recongured on-they to the second application which is car taillight detection system that is
better suited to recognize other vehicles in front at night or in tunnels.
The
third application loads four basic partially recongurable video lters, contrast,
grayscale, inversion, and Sobel.
Finally, a point-based rendering applications
implements an alternative type of a 3D graphic rendering system that is well
suited for volume data rendering.
The ESM design fullls the prerequisites for a modular pipelined and adaptive
system supporting real-time video applications. Its architecture splits the FPGA
into recongurable regions, called slots. Each slot can be updated at run-time
with a new functional logic block, not interfering with already loaded modules.
External SRAMs are provided to as local memory for modules requiring large
memory space. An external Crossbar FPGA provides on request all peripheral
signals to any module location on the main FPGA. An idealized modular video
processing system is shown in Figure 5.1. This proposed architecture shows a
pipelined computation in which the computational blocks are the modules that
107
5. Application Scenarios and Use Cases
process the image frames. The rst module buers with the image captured from
an image source. This can be a camera or a network module which collects a
video stream through any network channel. External SRAM devices are used to
temporary store frames between two modules, thus allowing a processed image
to be streamed to the next processing module.
Video algorithms process an uncompressed video data stream on a image-byimage basis. Each video frame itself is transmitted pixel-by-pixel. This is also
called a video stream or pixel stream. Many image processing lters require the
neighborhood matrix of a pixelor even a complete frame to compute the resulting
pixel. Capturing the neighborhood of a pixel can be done with a sliding window
[122] approach. The sliding window can be implemented with shift registers and
can process a continues pixel stream. Other algorithms require random access
to each pixel.
In these cases a complete frame must be stored and processed
before the next frame can be accessed.
Figure 5.1.:
A modular architecture for video streaming as implemented on the
slot-based structure of the ESM.
An adaptive video processing system is characterized by its ability to optimize
the computation performed on the video stream according to changing environmental conditions. In most cases, only one specic module is changed at a
time, while the system keeps running without an interrupt. For example, the
video capture module can be changed to optimize the conversion of pixels to
match the current brightness or the current landscape.
108
It is also possible to
5.2. Real-Time Video Processing on the ESM
change the video source from camera to a new one with dierent characteristics. In an adaptive system, the functionality of a module inside the data path
should be changed very fast to minimize any eects on the rest of the system.
Traditionally, this can be achieved by implementing multiple algorithms in parallel. Conguration parameters force the module to switch from one algorithm
to the next one. However, structures of even basic algorithms are not always
the same and algorithms have to be implemented in parallel.
A Sobel lter
[123], for example, cannot be changed into a Laplace lter by just changing its
parameters. This is also true for a Median-operator which cannot be replaced
by a Gauss-operator by just changing parameters. In these cases, the complete
module should be replaced by a dierent processing module of the same size,
while the rest of the system keeps running without an interrupt.
For the ESM we developed the concept of partially recongurable video lters
that we call
Video-Engines.
They are analogous to partially recongurable
modules or hardware tasks as they include an application specic data I/O
interface on top of a standard control interface.
Altogether, six dierent partial recongurable video processing modules, also
called Video-Engines in the following, are implemented on the ESM platform.
These include four basic video processing modules, an Edge-Engine and a TaillightEngine. They can be replaced by each other during run-time through partial
reconguration and will be introduced next.
However, before we introduce these Video-Engines in detail, we describe the
data ow and embedding of recongurable modules on the ESM platform rst.
5.2. Real-Time Video Processing on the ESM
5.2.1. Data Flow
The video processing data ow and the resource bindings are illustrated in Figure 5.2. The processing starts at the video input processor SAA7113H which
transforms an analog video signal (PAL, NTSC, SECAM) to a digital but interlaced YCbCr stream.
This stream is then transferred to the main FPGA via the crossbar FPGA
which also converts it into a RGB stream. On the main FPGA, each frame is
109
5. Application Scenarios and Use Cases
rst deinterlaced in an external SRAM device and then forwarded to the hardware Video-Engine on the main FPGA. This hardware processing module can
communicate with the PowerPC over the Hardware-Software Communication
link inside the Crossbar. It allows asynchronous data exchange through FIFOs.
The software part of the video processing application can work in parallel to the
main FPGA. Its results are sent back and further processed by the hardware
module. An output hardware module relays the video stream and control signals to the video-out FPGA back through the crossbar. The Video-Out FPGA
implements a frame buer with the two SDRAMs. After a complete frame is
received it is nally displayed via the VGA port.
However, deinterlacing cannot be bound to the crossbar FPGA because the
single 32 MByte SDRAM device connected to the crossbar does not fulll the
required throughput.
The implemented solution with the grayscale optimiza-
tion, uses only one of the six SRAMs on the main FPGA for deinterlacing.
Figure 5.2.:
The data ow chart of the overall system with resource bindings.
The deinterlacing must be done on the main FPGA as the single SDRAM module at the Crossbar does not support the required
throughput.
110
5.2. Real-Time Video Processing on the ESM
5.2.2. Main FPGA Partitioning
To support run-time reconguration of Video-Engines, the main FPGA has to
be partitioned for partial reconguration. For each partial recongurable module, a rectangular region must be dened on the FPGA. This region will be
surrounded by a static part supplying all signals and clocks. However, BlockRAMs are distributed equally over the FPGA chip area [25] and only ones inside
the recongurable region can be used by the recongurable module.
Figure 5.3.:
Implementation of partially recongurable image processing engines
on the ESM. The video signals occupy more than half of the Crossbar I/Os.
The blue shaded slots are assigned to the static part
and the red shaded region is used by the recongurable video module, also called engine.
The seven slots on the right and the two
connected SRAMs can be used for other recongurable or static
hardware modules.
There are six columns with twenty-four 18 Kbit BlockRAMs in micro-slots A,
F, K, L, Q, and V. However, the size of the recongurable region is dened
by the required number of I/Os for each module. Fifteen crossbar I/O groups
111
5. Application Scenarios and Use Cases
are needed to link the video input data to the main FPGA and to output a
RGB video stream.
The chosen partitioning is depicted in Figure 5.3.
blue shaded slots are the static part of the design, also called base.
The
Here,
the deinterlacing is done in the micro-slot A and micro-slot M to O contains
the output logic. The micro-slots B to L contain the recongurable region for
the partial recongurable video processing module.
It has three columns of
BlockRAM and enough logic cells for dierent video processing engines. The
rst SRAM on the left is allocated by the deinterlacing. Although the SRAMs
are located mostly above the recongurable area, they have to be connected
through the static part, because the Xilinx Early Access Partial Reconguration
does not allow the partial modules to access the I/O pins directly [21].
5.3. Implemented Video-Engines
5.3.1. Basic Video Filters
Point operations map pixel values without changing the size or geometry of the
image
I.
Each new pixel value depends solely on the previous value at the same
position. The point operation is independent of the image coordinates and the
original pixel values are mapped to new values by a function
I 0 (u, v) ← f (I(u, v), u, v)
f:
(5.1)
Typical examples of point operations include:
•
adjustment of image brightness or contrast,
•
color transformations,
•
intensity transformations,
•
global thresholding.
The capabilities of point operations are limited, as they cannot be used for the
sharpening or the smoothing of an image.
This is what lter operations can
do. The result of each lter operation depends on more than one original image
112
5.3. Implemented Video-Engines
pixel value.
For example, a simple smoothing lter could replace every pixel
0
by the average of its eight neighboring pixels. With I (u, v) = po at the same
position, the arithmetic mean is computed with
I 0 (u, v) ←
p0 + p1 + p2 + p3 + p4 + p5 + p 6 + p7 + p8
9
(5.2)
1
1
1 X X
I(u + i, v + j).
9 j=−1 i=−1
(5.3)
which is equivalent to
I 0 (u, v) ←
3×3
In this example, all nine pixels in the
support region are added with
the same weight. These coecients are also called lter matrix
H(i, j)
or lter
kernel. In this special case the lter matrix is


1 1 1
1
H(i, j) =  1 1 1  .
9
1 1 1
(5.4)
Incorporating the typical3 × 3 lter H , all pixels, except the border pixels, in
0
the new image I (u, v) are computed by the expression
0
I (u, v) ←
1
1
X
X
I(u + i, v + j) · H(i, j)
(5.5)
j=−1 i=−1
which is modied description of a discrete, two-dimensional convolution dened
as
I 0 (u, v) =
∞
∞
X
X
I(u − i, v − j) · H(i, j)
(5.6)
j=−∞ i=−∞
which can be written using the convolution operator as
I 0 = I ∗ H.
(5.7)
The size of the lter matrix is an important parameter of the lter as it species
113
5. Application Scenarios and Use Cases
how many pixels contribute to each resulting pixel value. Typical lter sizes are
3 × 3, 5 × 5, 7 × 7,
or even
21 × 21
pixels. Common linear lter operations are:
blur lter, nd edges lter, sharpen lter and mean lter.
On the ESM platform, basic video lters, such as contrast, grayscale, inversion,
Gauÿ, Laplace and Sobel lter, were implemented on the Virtex-II 6000 FPGA
to show the partial reconguration capability. Eects of these lters are shown
in Figure 5.4. These partial recongurable modules operate in streaming mode
without any image buering and demonstrate the run-time reconguration of
hardware logic on the ESM. At startup, a blank lter just forwards the input
pictures to the video output. This startup module is then replaced by any of the
implemented lters and the result can be seen instantly on the display output.
Figure 5.4.:
Basic image lters implemented as partially recongurable modules
on the ESM.
114
5.3. Implemented Video-Engines
5.3.2. Edge-Engine
The Edge-Engine is a real-time video processing algorithm [124] designed to
detect lanes and obstacles in a video stream taken by a camera mounted inside
a vehicle.
Figure 5.5 shows a processed image with visual guides.
The green
lines indicate the lane and the red color is an indicator for possible obstacles.
Figure 5.5.:
The Edge-Engine enhances the camera data by displaying the edges
in the image and marking the lane with green lines. The red pixels
indicate possible obstacles as they will appear as horizontal edges.
The more red pixels are shown over an object the more likely an
obstacle was found.
The functionality of the Edge-Engine is modularized into three processing steps.
First, the incoming images are convolved with a Sobel lter. By calculating the
gradient, edges are extracted. Values above a threshold are marked blue and
are the basis for the following two steps. Next, the lane is detected using the
Hough transformation that looks for predened shapes in the image. Here, the
predened shapes are two straight lines that parametrized according to the lane.
The two straight lines are markings start in the vanishing point and go the lower
corners of the image. If found in the captured image, the two lines tting best
115
5. Application Scenarios and Use Cases
the predened shapes are colored green. In the last step, the Edge-Engine tries
to identify obstacles. For that, horizontal edges above a dened brightness or
sharpness are marked with red.
The three image processing steps are designed to be exchangeable. Under special
conditions, the Sobel lter could be replaced by dierent edge detection lter.
5.3.3. Taillight-Engine
The Taillight-Engine's [125] purpose is to identify cars on dark roads or in
tunnels.
Driving in darkness, taillights identify the driving vehicles in front.
However, in tunnels there are also static tunnel lights, lit up lane markings, and
reections which have to be ltered out. A resulting image with visual guides
is present in Figure 5.6. All lights in the image are marked but only matching
light pairs are highlighted with a red box, indicating a driving vehicle in front.
The Taillight-Engine [126] is divided into two parts, hardware ltering and software computations [49].
Pattern matching is done in hardware to extract all
light sources, meaning bright and enclosed pixel areas. These are labeled by a
clustering algorithm and dene the feature points consisting of position, size,
and brightness. Still in hardware, they are marked with green spots. The software part compares them with previous feature points and calculates the motion
vectors. Static lights can be ltered out as they are moving away from the vanishing point at a certain speed. Now, cars can be identied by nding matching
pairs of lights.
Besides same size and brightness, they must be at the same
height and be visible for several frames. Finally, car positions are sent back to
the hardware for visualization.
Pattern matching is performed by the spotlight matching module to nd taillight
of cars that are driving up-front. It identies the all light sources in the captured
image. Based on these, the segmentation module performs a segmentation of
the recognized light sources. The resulting list of lights is cleared of all static
light sources. In the next step, light pairs are built out of the list of non-static
lights. Then, the probability of a light pair being a taillight of an ahead moving
vehicle is computed.
Finally, a list of valid taillights is sent to the hardware
module in order to highlight all detected vehicles moving ahead.
A complete
overview of the data ow of the Taillight-Engine is shown in Figure 5.8.
116
5.3. Implemented Video-Engines
Figure 5.6.:
The result of the Taillight-Engine demonstration at CeBIT 2008,
more details in Section 5.2.7.
a green spot each.
All found lights are highlighted by
If two lights match in size and brightness, are
located at the same height, and have car-like motion vectors, they
are identied as a pair belonging together, and a red car marker is
placed between them.
Spotlight Matching
In the rst processing step of the Taillight-Engine, the
spotlight module performs a pattern matching in which individual light sources
are identied. The Spotlight-Engine recognizes taillights in an image by nding
bright pixels within a roughly square shape surrounded by distinctly darker
pixels.
The general form of the pattern is illustrated in Figure 5.7 a).
denotes the mask for the brightness range of possible lights,
PU
PO
the mask of the
dark environment. This pattern is applied to every pixel in the current image,
and the darkest pixel in
PO
PO
is determined. If there is no brighter pixel in
is marked as a light source. Additionally, a number of pixels inside
PO
PU ,
must
reach a given brightness level.
117
5. Application Scenarios and Use Cases
Figure 5.7.:
From left to right: a) Light pattern matrix in
Spotlight-Engine,
b)
applied to taillight in image and c) applied to a lane in image.
A demonstration of this approach is illustrated in Figure 5.7 b). Hereby, a car
taillight pattern is detected. In Figure 5.7 c) a lane is not recognized as light
since the lane is bright in the area of
PO
and exhibits the same brightness in
PU .
Segmentation
In the second step of the processing chain, segmentation is
done by the segmentation module. Light points are grouped together in regions
and based on these a summary list is compiled.
Each region consists of the
coordinates of the included pixels and their overall brightness. The regions are
recognized in a pixel-based manner, similar to the spotlight matching.
This
approach is also known as "Connected Component Labeling" [127].
Determining Light Pairs
The processing result of the spotlight matching and
segmentation is a list of light sources. In the next step, static lights, e.g., idle
vehicles, roadway lighting or reections of roadway restrictions, are ltered out
of the list.
Previous images and light sources lists, are used to determine a
motion vector for each light source. The apparent movement of a static light is
obviously not caused by a motion of the static light, but by the camera mounted
inside a moving vehicle. Motion vectors of each light can be used to nd out
whether the light source is static or not.
However, direction changes of the
road or uctuating lighting conditions can lead to motion vectors, which don't
exactly point in the opposite direction of the vanishing point. Therefore, more
than just two subsequent images are used to determine the motion vector.
The second processing step on the list of found light sources examines the relationship between non static lights. The objective is to nd light pairs, which
118
5.3. Implemented Video-Engines
correspond to the taillights of a moving vehicle. The distance between two lights
and their motion vector are the main criteria for the selection.
HW/SW Partitioning
On one side, the taillight recognition system consists
of image processing operations applied to each video image. On the other side,
more complex operations must be applied to nd matching taillights. Dierent
criteria are evaluated to dierentiate between static lights and vehicle taillights.
The pixel-level image operations are simple and may be executed in parallel.
Therefore, they should be implemented in hardware. The operations on the light
list are more complex and include control-intensive steps. These operations can
be partially or completely implemented on the ESM in software. Based on the
PowerPC performance for these operations, we decided to completely implement
the taillight search in software.
Implementation
The implementation of the Taillight-Engine on the ESM cor-
responds to the hardware-software partitioning shown in Figure 5.8. The data
transfer between the PowerPC and the hardware module on the main FPGA
required a specialized hardware/software communication module.
The imple-
mentation of the hardware and software components of the Taillight-Engine is
given next.
Pattern matching of the Spotlight-Engine is realized by a hardware module for
the main FPGA. The brightness of each pixel in the current frame is compared
to the border of the 11x11 pattern matrix. One option is to store the current
image in the external single-ported SRAMs.
A bandwidth saving method of
loading the pixel matrix has to be used for fast processing. In each step, the
matrix is shifted vertically by one pixel. In this case only the rst line of the
matrix has to be loaded from the SRAM. The pixels in the outer columns and in
the last line are made available by shift registers. However, this method requires
several clock cycles per pixel due to the limited SRAM bandwidth.
Another option was to use the internal BlockRAMs on the Xilinx Virtex-II
6000 FPGA [128]. BlockRAMs oer a higher bandwidth and support dual-port
access. This enables the loading of the rst and the last line of pixels in a single
clock cycle. The rst, fth and eleventh column of pixels have to be buered in
shift registers.
119
5. Application Scenarios and Use Cases
Figure 5.8.:
HW/SW partitioning of the Taillight-Engine on the ESM.
Due to the greater performance of the BlockRAMs, a memory controller with
a bandwidth of 128 bit is implemented.
Several BlockRAMs are cascaded to
store the pixel data required by the matrix. Thus, the memory interface allows
a byte-by-byte addressing and may also be used with other pattern matrices.
The maximum clock speed is 52,1 MHz with 90% of the BlockRAMs and 30%
of the Virtex-II 6000 FPGA logic resources used.
384 · 288 · 8 bit = 110592
110592/16 bit = 6912 bit.
The actual implementation uses images with a size of
bit (QVGA resolution). Each BlockRAM contains
Based on the new memory interface, the pattern matrix can be evaluated in each
clock cycle. The brightness of the pixel in the center of the matrix is compared
with the brightest pixel at the border.
A pipeline architecture is applied to
increase throughput. As 40 pixels have to be compared the pipeline consists of
dld(40)e = 6
120
pipeline stages.
5.3. Implemented Video-Engines
Figure 5.9.:
The implementations of the video applications on the Virtex-II 6000
by comparison: Contrast lter (left), EdgeEngine (center), and TaillightEngine (right)
An additional BlockRAM is used to store the results of the pattern matching. If
the pixel in the image is darker than these at the border of the matrix, a zero is
written to the corresponding position in this buer. Otherwise, their brightness
dierence is saved.
In the next step, the results of the hardware pattern matching are transferred to
the PowerPC. Here, the software Labeling-Engine processes the output buer
of the Spotlight-Engine.
Regions with a similar positive dierence value are
grouped together to light regions and a list of light sources is created as a
result.
The hardware-software communication module is used to transfer this list to
the PowerPC. After software processing the list of detected taillights is sent
back to the main FPGA. Based on software results, the visualization module
draws green boxes around static light pairs and red boxes around taillights in
the output image, as shown Figure 5.6.
The implementations of both engines of the Virtex-II 6000 FPGA can be seen
in Figure 5.9, Edge-Engine in the center and Taillight-Engine on the right. By
contrast, the contrast lter on the left mostly consists of wiring only.
Run-time analysis and worst-case tests were performed on video data with a
frame rate of 25 frames per second. The image resolution of each video frame is
384x288 pixels. The hardware and software processing elements have
40 ms
to
process one video frame. Hardware run-time can be calculated accurately. Both
Spotlight-Engine and Labeling-Engine are working at 50 MHz. The SpotlightEngine calculates one pixel per clock cycle. Only four clock cycles are needed
121
5. Application Scenarios and Use Cases
at the start of each column. During this time, the pixel matrix is initialized.
The resulting run-time for the Spotlight-Engine can be calculated by:
tSpotlight =
H
and
w
w · (h + 4)
fHW
are the height and width of the video image.
For an image size of
384x288 and a 11x11 pixel matrix, the processing time for one video image frame
is
tSpotlight = 2.23
ms.
The run-time of the Labeling-Engine depends on the amount of white pixels, i.e.
lights recognized by the Spotlight-Engine. There is no additional time needed
for initialization.
tLabeling =
With
h, w
h · w · (4 · p + 2 · (1 − p))
fHW
representing the image size,
p
is the probability for each pixel to
be a light pixel extending an existing light region.
smaller than 0.01 in common video data. The obtained run-time with
is
p is
p = 0.01
As evaluation shows,
tLabeling = 4.46 ms.
The total hardware run-time is the sum of tSpotlight and tLabeling :
tLabeling ≤ 2.23 ms + 4.46 ms = 6.69 ms.
tHW = tSpotlight +
tSW includes
The software run-time
tSW = 19.4 ms in worst case. Now
the total run-time can be calculated as t = tHW + tSW = 6.69 ms + 19.4 ms =
26.09 ms ≤ 40 ms that satises the real-time constraint of 40 ms for each video
the communication overhead and gets up to
frame.
CeBIT 2008 Demonstration
In 2008 we had a great opportunity to demon-
strate the Erlangen Slot Machine together with Bayern Innovativ during the
CeBIT fare.
122
5.3. Implemented Video-Engines
Figure 5.10.:
CeBIT 2008 group picture with Prof.
Walter Stechele, Rafael
Pohlig, Christopher Claus, Matthias Kovatsch and Mateusz Majer (from left to right).
In our demonstration a live video signal was streamed to the ESM, processed,
and displayed on an attached VGA panel.
The selection of a specic partial
recongurable Video-Engine was controlled through our Ethernet ESM-Shell
connection to the PowerPC on the ESM Motherboard.
During this event we
demonstrated successfully the run-time reconguration capabilities of the ESM
platform.
123
5. Application Scenarios and Use Cases
5.4. A Point-Based Rendering Application
Current graphic cards include advanced graphic processing units to accelerate
the rendering of 3D objects with millions of polygons. As object models grow in
complexity, the rendering approach based on points as primitives is regarded superior in terms of scalability and eciency. Next generation graphic cards could
contain recongurable devices, such as FPGAs, to oer fast point-rendering
units a new mechanisms for custom, run-time exchangeable accelerators.
We propose a hardware point-rendering architecture tailored specically for recongurable systems [50, 129]. The presented implementation on the
Slot Machine
Erlangen
demonstrates on one hand the computing power of the approach.
On the other hand, it provides valuable insights into possible future improvements for this application class.
In recent years, two particular factors in the graphics cards sector dramatically changed.
First, performance and visual quality has leaped into new ar-
eas. Second, graphic cards have been established as computational accelerators
[17].
However, as polygonal models have become increasingly complex, the
size of the projected primitives decreased accordingly. This raised the question
whether polygons are the right primitives for very detailed and complex models
[130, 131, 132].
The major goal of point-based rendering algorithms is to achieve continuous
interpolation between discrete point samples which are irregularly distributed
on a smooth surface [133, 134, 135, 136].
Rendering large data sets at low
magnication will often cause primitives to be smaller than the output device
pixels. In order to minimize rendering time, it is desirable to control the level
of detail through the use of multi-resolution model objects [136, 137].
Recent approaches such as [138, 139] address high speed point-rendering by
exploiting GPU acceleration and on-board video memory caches.
A state of
the art ASIC chip and a multi-FPGA architecture for point-rendering were
recently presented in [140] and high image quality aspects have been considered
in [141, 142].
A partitioned DSP/FPGA implementation [143] uses the FPGA only for the
Z-buer test and nal screen buering. All other rendering operations are performed on the DSP and not in hardware. The design achieves a throughput of
5 million points per second.
124
5.4. A Point-Based Rendering Application
Here, we present an ecient Direct Point-Rendering hardware architecture on
an FPGA platform and demonstrate that a high performance and at the same
time resource ecient implementation on FPGAs is feasible. Furthermore, the
implementation distinguishes itself from known approaches by a careful HW/SW
partitioning strategy to balance performance and resource utilization trade-os.
5.4.1. Background
In point-based rendering, a 3D object is represented by a set of points [130, 131,
T
134, 142]. Each point pi consists of its 3D coordinate xi = (x, y, z) , a color
value
ci = (r, g, b),
and a normal vector
sampled at the point. The additional
ni
that is orthogonal to the surface
w-coordinate
is necessary to obtain the
3D to 2D projection by means of matrix operations [144], as shown in Figure
5.11.
Figure 5.11.:
Direct point rendering is the simplest 3D rendering method. The
points are assumed to be samples of a surface and are transformed
to the 2D screen space. The necessary pipeline is a simplied polygon rendering pipeline.
Direct point rendering is the simplest 3D rendering method.
The points are
assumed to be samples of a surface and are transformed to the 2D screen space.
The necessary pipeline is a simplied polygon rendering pipeline.
125
5. Application Scenarios and Use Cases
Data Flow
pipeline.
The rasterization of 3D data points is performed in the
rendering
The most important standards of this model are OpenGL and Di-
rect3D. A detailed description of the pipeline is not in the scope of this work,
but for reasons of understandability, a short overview will be given here.
In Figure 5.12, the stages of the rendering pipeline are shown.
transformation
The
consists of a translation, a scaling, and a rotation operation
that map 3D points from object space into world space.
view transformation
Furthermore, the
maps points from world space into the camera space that
is dened by the position and orientation of the virtual camera.
Since both
mappings are linear transformations, they can be combined to a single
matrix, called
model
ModelView
matrix
MM V
4×4
that will be explained on the next page.
Model Memory
Z-Buffer
Screenbuffer
Z-Test
View Transformation
Persp. Division
Clipping
Projection
Lighting
Backface Culling
View Transformation
Model Transformation
Point Data
Point Rendering Pipeline
Screenbuffer
Figure 5.12.:
Overview of the main signal ow through the point rendering
pipeline.
The ESM implementation of the point-based rendering
pipeline is shown in Figure 5.14.
The subsequent
backface culling
stage ensures that only points with normal vec-
tors pointing towards the camera are processed further, i.e., points that sample
surfaces visible to the camera.
After this, the point's actual color value is calculated in the
lighting
stage. For
this purpose, the color value is weighted by a factor that depends on the angle
between the point's position relative to the dened location of the light source
and its normal vector. This technique is known as
The purpose of the
projection transformation
Lambert shading.
is to map the viewing frustum
dened by the camera parameters (focus, eld-of-view, etc.) to a standard cube
126
5.4. A Point-Based Rendering Application
[−1, 1]. This transformation is also described by a 4×4 matrix,
matrix MP .
with side lengths
the projection
Based on the unit cube, the
clipping
stage determines the points that fall outside
the camera frustum and discards them.
After that,
perspective division
position on the image plane is given
viewport transformation
w-coordinate occurs. Now,
by its x- and y -coordinates.
by the
a point's
The nal
determines the pixel that represents the point according
to the current viewport resolution. Finally, the
z -coordinate
is used to ensure
that only points not occluded by others are displayed (Z-Test).
ModelView Transformation
The ModelView matrix is used to transform
model coordinates into camera coordinates.
Allowing only ane transforma-
tions, we can simplify the last row of the ModelView matrix, which reduces the
number of multiplications from 16 down to 12.

c1,1 c1,2 c1,3 tx


c2,1 c2,2 c2,3 ty 
· xi = 
 xi
c3,1 c3,2 c3,3 tz 
0
0
0 1

x0i = MM V
(5.8)
Similarly, we allow only linear transformation for the normal vector. This reduces the multiplications down to 9 instead of 16.

c1,1 c1,2 c1,3

c
c
c
n0i =  2,1 2,2 2,3
c3,1 c3,2 c3,3
0
0
0
Lighting

0

0
 ni
0
1
(5.9)
A reection coecient is produced by the lighting computation and
multiplied with the point color to output the visible screen color. However, an
8 bit per pixel screen buer is used which is only suitable for gray color coding.
The normal vector is decoded by the memory controller to Cartesian coordinates
n0i
00
but is not further normalized. Since the normalized vector ni =
is required
kn0i k
127
5. Application Scenarios and Use Cases
by the lighting computation to obtain correct results, the normalization must
be performed at this stage.
The reection coecient %i depends on the angle between the point's surface
0
normal ni and the direction to the light source l . We use diuse reection for
our lighting computations, i.e., the light source is assumed to be far away. As
0
a result, l is constant for each point. With l being the normalized vector of l ,
the coecient is calculated as
0
00
0
00
0
00
%i = cos ∠(l, n0i ) = hl0 , n00
i i = ni,x · lx + ni,y · ly + ni,z · lz .
Projection
(near plane
l, r
(5.10)
The projection transformation uses the intrinsic matrix values
n,
far plane
f,
coordinates of left and right vertical clipping planes
and of top and bottom horizontal clipping planes
t, b).
The projection
transformation is shown in Equation 5.11 and can be implemented using 6 multiplications and 3 additions because all denominators are calculated up-front on
the PowerPC.
 2n
r−l
 0

00
0
xi = MP · xi = 
 0
0
0
2n
t−b
0
0

− r+l
0
r−l
− t+b
0 
 0
t−b
2f n  xi
f +n
− f −n − f −n 
−1
0
(5.11)
5.4.2. Rendering Pipeline
The point-rendering implementation is split into the main hardware pipeline
software part. The rendering process is controlled through the software part, as
shown Figure 5.13.
HW/SW Partitioning
The point-rendering pipeline itself is performance-critical
and should be implemented therefore in hardware as its throughput determines
the main system performance. Consequently, the model memory, the Z-buer,
and screen buer must be hardware controlled. Hence, these parts are implemented in hardware on the main FPGA.
128
23 Bit, Addr
25 MHz
176 Bit, Point
25 MHz
8 Bit, Data
25 MHz
Model Memory
Controller
2 Bit, Control
25 MHz
MainFPGA
Point
Rendering
Pipeline
8 Bit, Data
25 MHz
27 Bit
Control/Ack
State
Memory
1306 Bit
Data
Protocol
FSM
24 Bit, Color
25 Mhz
Clock
50 MHz
DCM
2 Bit
Synchronisation
PLL
Clock
25 MHz
1306 Bit
Data
Clock
25 Mhz
128 Bit
Pointdata
44 Bit
Control/Ack
SRAM
Screen Buffer
23 Bit, Addr
25 MHz
SRAM
23 Bit, Addr
25 MHz
8 Bit, Data
25 MHz
23 Bit, Addr
25 MHz
8 Bit, Data
25 MHz
Z-Buffer
SRAM
23 Bit, Addr
50 MHz
8 Bit, Data
50 MHz
23 Bit, Addr
50 MHz
8 Bit, Data
50 MHz
SRAM
SRAM
8 Bit, HWSWCom
50 MHz
Crossbar
FPGA
24 Bit, Color
25 MHz
Figure 5.13.:
32 Bit
Data
PowerPC
MotherBoard
2 Bit,
Synchronisation
Graphics
FPGA
BabyBoard
SRAM
Model Memory
5.4. A Point-Based Rendering Application
Design overview of the main signal ow on the ESM platform.
Annotated are the signal bit widths and clock frequencies.
The
implementation of the point-based rendering pipeline is shown in
Figure 5.14.
All matrix computations required by the point-rendering pipeline can be implemented either in software or hardware. As long as software execution time and
129
5. Application Scenarios and Use Cases
the communication overhead is not prohibitive, the software solution is saving
many hardware resources, in our case 6273 slices and 48 block multipliers on
the Virtex-II 6000 FPGA, and has an inherent exibility advantage.
Adding
a new transformation, e.g., the OpenGL gluLookAt transformation, becomes a
simple software extension. Furthermore, a double precision oating point number format is used for all arithmetic operations. After computing the matrix in
software, the results are sent to the hardware point rendering pipeline via the
Crossbar, as shown Figure 5.13.
Software Control Flow
In order to process a point model object, four main
steps have to be executed in software:
1. Model point data must be downloaded onto the Babyboard local memory prior to any rendering. The model memory stores the point data in
coherent point group objects.
2. Update of the pipeline state, which is fully controlled through software.
Here, only the pipeline state is transferred and, e.g., not the operands
for the ModelView matrix.
This means that the software generates the
appropriate pipeline state after computing, e.g., the ModelView matrix
MM V .
3. Enable execution inside the point-rendering pipeline.
Now model point
data is continuously read from the model memory and fed into the point
rendering pipeline. The rendered picture is then written into the output
screen buer, which implements a double buering technique. However,
the screen buer and the Z-buer must be cleared before a new picture
can be rendered.
4. Finally, the rendered picture is read form the screen buer and transferred
via the crossbar to the VGA output at a resolution of
640 × 480
pixels.
In the following, implementation issues of the hardware pipeline are discussed.
Number Representation
Each point coordinate in our model data is repre-
sented by a 24 bit word, which exactly matches our implemented xed point
Q7.16 number format (7 integer and 16 fractional bits).
Additional compression
of the coordinates is non trivial and was not implemented. However, all normal
130
5.4. A Point-Based Rendering Application
vectors are compressed. This allowed us to reduce the bit width from 72 to 15
bit, as proposed in [134]. The color information is stored in a coded color index.
Therefore, one point of our model data is encoded in 12 byte.
External SRAM Utilization
The ESM platform has 6 SRAM banks with 2
MB capacity each. Our object model memory occupies two SRAMs and has to
deliver 12 byte for every pipeline clock period. The double screen buer uses
another two SRAMs, see Figure 5.13. Therefore, only the last two SRAMs can
be used to implement the Z-buer which limits our implementation to 16 bit
instead of the recommended 24 bit [145].
Pipeline State Vector
Figure 5.14 shows the implementation of the render-
ing pipeline. The pipeline state vector holds the current state of the complete
pipeline.
Table 5.1 lists all controlled pipeline elements together with the re-
quired bit widths.
Control words issued by the protocol state machine have
a length of 1306 bit (see the Data signal outgoing from the protocol FSM in
Figure 5.14).
State
ModelView Matrix
Inverse ModelView
Matrix
Projection Matrix
Light Vector
Scaling and Transl.
Background Color
Activation
Table 5.1.:
Pipeline Element
ModelView Transf.
ModelView Transf.
Math. Object
4x4 Matrix
4x4 Matrix
Bit Width
384 Bit
384 Bit
Projection
Diuse Shading
Window Transf.
Screen Buer
All
4x4 Matrix
3 Vectors
4 Parameters
1 Parameter
1 Parameter
386 Bit
72 Bit
48 Bit
8 Bit
24 Bit
State vector information of the point-rendering pipeline which control
the complete rendering process.
Protocol State Machine
controller.
The protocol state machine is the main hardware
It is responsible for the hardware control of the point rendering-
pipeline as well as the HW/SW interface. The software part controls the setup
phase and the rendering process by sending 104 bit instruction words to the
protocol state machine.
131
5. Application Scenarios and Use Cases
176 Bit, Point
25 MHz
Model Memory
Controller
Point-Rendering Pipeline
2 Bit, Control
25 MHz
ModelView
Backface
Culling
Lambert Shading
Projection
384 Bit
MV-Matrix
384 Bit
Inv. MV-Matrix
2 Bit
Activation
72 Bit
Lightdirection
1 Bit
Activation
386 Bit
Proj-Matrix
1306 Bit
Data
3 Bit
Ack
24 Bit
Control
Clipping
Protocol
FSM
1 Bit
Activation
Perspective
Divide
Windowing
Transform
2 Bit, Control
25 MHz
56 Bit, Pixel
25 MHz
8 Bit, Data
25 MHz
23 Bit, Control
25 MHz
SRAM
8 Bit, Data
25 MHz
23 Bit, Control
25 MHz
SRAM
8 Bit, Data
25 MHz
23 Bit, Control
25 MHz
SRAM
8 Bit, Data
25 MHz
23 Bit, Control
25 MHz
1 Bit
Ack
Z-Test
Screen Buffer
8 Bit, Color
25 Mhz
SRAM
Color-LUT
Figure 5.14.:
48 Bit
Windowparameters
1 Bit
Control
2 Bit
Ack
8 Bit
ClearColor
1 Bit
Activation
2 Bit
Control
2 Bit, Synchronisation
25 MHz
1 Bit
Activation
8 Bit, Color
25 Mhz
The complete point-rendering pipeline implemented on the VirtexII 6000 FPGA. Data ows from top to bottom and includes point
data and control signals. Each pipeline element can be stalled.
The operand is encoded in 8 bit and the remaining bits are used for data transfers. Only four instructions are needed to update the ModelView matrix.
132
5.4. A Point-Based Rendering Application
The instruction opcode is grouped into a) model memory operations, b) state
update operations, and c) rendering control operations. Model memory operations allow the software to alter the model data. The points are stored in a
linear array. This array is segmented into groups with a start and stop index, as
rendering is only performed on complete groups rather than individual points.
State update operations allow the software to update the various parameters of
the pipeline as presented in Table 1. Some parameters are set only once (e.g.
the window-parameters), others are expected to change quite often (the ModelView matrix).
Finally, rendering control operations enable the rendering of
point groups through the activation of individual pipeline elements.
Pipeline Elements
The implemented point-rendering pipeline has a through-
put of one point per clock cycle. Every rendering transformation and visibility
test is mapped to a corresponding hardware pipeline element, as shown in Figure 5.14. Two signals are used to control the visibility of the currently processed
point.
The latency of a pipeline element is not crucial, as long as its throughput is
high. All control signals and point data are synchronously passed through each
pipeline element.
Z-Buer
For optimal Z-buer implementation, a dual ported SRAM is needed
which was not available. A pipelined variant of the Z-Buer algorithm requires
dual-port memories to be available. The implemented Z-Test is clocked at double the pipeline clock frequency so that the available single-port SRAMS can be
used. We had to double the clock frequency of the SRAM controller compared
to the pipeline frequency.
However, special care has to be taken because the
same point and control data are now sampled twice.
5.4.3. Implementation Results
The hardware resource utilization for the implemented point-rendering pipeline
is shown in Table 5.2.
In our implementation, three dierent variants for the multiplication were used.
The rst variant uses only the MULT18x18 blocks found in the Virtex-II, which
133
5. Application Scenarios and Use Cases
results in the use of four of these blocks per multiplication. The second variant
uses a hybrid multiplier generated by the CoreGen utility [128].
It uses one
MULT18x18 block and implements the remaining logic using slices.
Due to
pipelining, this implementation has a throughput of one 3D point per clock
cycle. The third variant uses only slices to minimize the resource utilization.
Element
Slices
Mult.
Clock (MHz)
Latency
Impl.
ModelView
6,163
21
193,498
4
Hybrid
Backface Cull.
98
0
-
1
-
Lighting
2,365
24
130,873
58
Hybrid
Projection
391
24
219,106
3
MULT18x18
Clipping
231
0
-
1
-
Persp. Div.
1,965
12
83,942
46
MULT18x18
Window Trans.
326
0
200,120
2
Slices
Z-Test
291
0
127,632
5
-
Screen Buer
118
0
210,748
1
-
Color Sel.
0
0
-
1
-
Sum
11.948
81
-
122
-
Table 5.2.:
Hardware resource utilization for the point-rendering pipeline.
No
details on the clock frequency can be given for Backface-Culling,
Clipping and Color Selection since these modules are too small. The
last column (Vars.) shows which multiplier variant was used respectively to implement the multiplications of the transformation.
Our nal hardware implementation of the point-rendering pipeline, as shown in
Figure 5.13 and 5.14, consumes 13,462 (40.4%) slices, 80 (56%) block multipliers
of the Virtex-II 6000 FPGA, and achieves a clock frequency of 60 MHz. This
means that we can render 60 million 3D points per second. Our model memory
can store model objects with up to 262,144 points in the nal implementation.
This factor is only limited by the size of our external SRAM memory bank.
In comparison, the proposed GPU system in [138] renders up to 28M mid-quality
or up to 10M high-quality 3D points per second on the latest graphics hardware.
Older software implementations are only able to render up to 2 million points
on a high-end graphics workstation like the SGI Onyx2 [134].
Our implemented HW/SW co-design architecture for the point-rendering pipeline
has a high performance and a resource ecient implementation on FPGAs. In
this implementation a careful HW/SW partitioning was used to nd a good
134
5.4. A Point-Based Rendering Application
performance and resource utilization trade-o. The resulting rendering pipeline
architecture can easily be extended to a parallel architecture with two or even
four rendering pipelines. However, the memory bandwidth will then become the
main performance bottleneck. Still missing features are surface splatting and
level of detail control [134, 142].
Figure 5.15 shows two screenshots of the implemented point-rendering pipeline
running on the ESM platform, shown in Figure 3.5. The model object consists
of
45, 357
points and is displayed with
34.5
frames per second. Screenshot 5.15
a) shows the plain point model, whereas screenshot 5.15 b) shows the same
model with Lambert shading activated.
Figure 5.15.:
Rendered Venus point model screenshots a) without and b) with
shading (45,357 points). The pictures were directly taken from the
VGA output of the ESM platform shown in Figure 3.5.
Due to limited external frequency of the used SRAMs on the ESM platform
the point rendering pipeline has to wait 16 clock cycles for a new point sample.
Therefore, our current rendering throughput drops to 3.75 million points per
second.
Potential areas of future research are the use of partial run-time reconguration
of hardware pipeline elements.
The three most benecial hardware units are
the lighting stage, screen buer stage, and the model object memory controller.
The run-time reconguration of the lighting pipeline element will enable loading of custom hardware shaders right into the rendering pipeline. By changing
the screen buer stage during run-time, we can include custom lters like Sobel
or median lters before writing a picture to the screen buer.
Another very
135
5. Application Scenarios and Use Cases
interesting concept are custom memory controllers which can create procedural model objects [146] based on precomputed parameters stored in the model
memory.
136
6. Conclusions
6.1. Summary of Contributions
The main contribution of thesis is the development and the implementation
of an FPGA-based computer supporting partial module development and instantiation on a standard FPGA development platform [38, 39, 40, 41, 42]
called
Erlangen Slot Machine
(ESM). The separation of peripheral I/Os from
the main FPGA decouples I/Os from their physical pin locations on the FPGA.
The implications of external memory access, inter-module communication, I/O
pin decoupling, and tool support are also addressed.
Further analysis of inter-module communication schemes shows their impact on
the communication bandwidth and delay. It is found that direct neighbor communication via bus-macros is the fastest scheme and requires the least amount of
additional resources. If two distant, partial modules need to be linked together
then either a crossbar or a recongurable-multiple-bus communication link can
be used [46, 47].
The detailed specication of the Erlangen Slot Machine describes an FPGAbased architecture responding to the previously identied dilemmas of existing
FPGA platforms, as described in Section 3.4. The Erlangen Slot Machine is the
rst FPGA-based platform design to fully integrate partial reconguration support at the printed circuit board level [44]. Unlike other FPGA-based platforms,
it frees the FPGA from run-time I/O pin binding and supports a notion of logical slots that provide a recongurable regions with predened communication
interfaces and local SRAM access.
137
6. Conclusions
Based on the analysis of existing FPGA boards, the implementation of the Erlangen Slot Machine architecture created a two-board platform solution with
a Motherboard and a Babyboard.
A crossbar interface on the Motherboard
allows to switch peripheral I/O signals to any I/O pins of the main FPGA located on the Babyboard [45].
Additionally, the Motherboard contains Video
input and output peripherals and an embedded Linux software framework to
control the run-time reconguration of the main FPGA and the peripheral I/O
ow through the crossbar interface. Furthermore, a PowerPC hosting the Linux
kernel is connected to a network interface which is used to update software components and bitstreams at run-time. The FPGA resources and in fact the whole
platform can be remotely controlled. Several users can share on ESM platform
during development as long as critical library calls are locked for exclusive use.
Software running on the PowerPC microprocessor can communicate with the
partial modules on the main FPGA through memory mapped registers of the
crossbar FPGA. On the main FPGA, a communication module is needed inside
the partial module utilizing a software-hardware communication link.
The Babyboard contains a Virtex-II 6000 FPGA as the main partially recongurable device, a reconguration management FPGA (RCM), a ash memory for
bitstream caching, six external SRAM banks, and a CPLD device for start-up
conguration [43]. Reconguration management is performed under the control
of the Motherboard but bitstreams are normally loaded from the local ash
memory located on the Babyboard. If a partial module has to be recongured
then control commands are sent by the software running on the PowerPC to
the reconguration management FPGA (RCM). The relocation of a partial bitstream is set by an oset parameter passed on to the RCM which then modies
the partial bitstream on-the-y during bitstream loading.
To complement the hardware support built into the ESM platform, two software
tools were developed to ease the development phase of partial recongurable
modules. SlotComposer is a tool developed for an automated bitstream generation of partial modules. Moreover, SlotComposer converts a VHDL design to
a partial design by modifying the top-level design le and constraint le. In the
beginning, SlotComposer converts a VHDL design into a partial design by inserting bus-macros and intermediate signals in the top-level VHDL le between
each partial module and the static part of the design. At the same time SlotComposer modies the constraint le to place all bus-macros at their correct
locations. Then a new project directory tree is created and scripts for the batch
synthesis and the partial design ow are created. In the end, these steps allow
138
6.1. Summary of Contributions
an automated transformation of a VHDL design and the generation of partial
bitstreams. The generated scripts for synthesis and partial design ow do not
require any interaction or GUI use.
Another software tool generates a set of partially recongurable hardware modules, each implementing a recongurable hardware task, for benchmarking purposes.
Each generated partial module has a simple communication interface
with the operating system rmware running on the PowerPC. The execution
time and the physical size of each task is specied before its generation and
is therefore xed at design time. The current state of each task can be monitored and changed through the communication interface embedded in each task.
These features enable the comparison of time overheads and dierent scheduling
strategies for partial reconguration on various FPGA platforms.
To evaluate the tness of the ESM platform for hardware designs that are close to
commercial requirements, two video processing applications were implemented.
Both applications utilize all features of the ESM platform.
The rst video
application implements various run-time recongurable video lters [48]. The
second application successfully implements a video processing application for
lane and object detection that could be used in a driver assistance system [49].
Through a software-based interface running on the PowerPC microprocessor the
type of video lter can be changed on demand or according to a schedule. In
this application the incoming video stream is switched by the crossbar FPGA
to the static deinterlacing module on the main FPGA. After deinterlacing the
image stream ows to the partial recongurable region, where the video lter
circuits can be changed on demand during run-time. The processed output of
the video lter module is passed to a static module that returns the processed
video image stream back to the crossbar. From there, the crossbar directs the
processed video stream to the VGA output.
During run-time, the partial reconguration of the video application is controlled
by the OS framework running on top of the PowerPC's Linux kernel.
The
reconguration process is controlled by an interactive command line software
called ESM shell. It is based on the same software framework that provides an
API to monitor and control all aspects of the ESM platform.
The last application implements a point rendering pipeline on the ESM platform.
Point rendering is an alternative 3D rendering scheme based on point clouds instead of traditional triangle meshes [50]. The rendering pipeline is implemented
139
6. Conclusions
on the main FPGA and can compute 81 xed point multiplications in a single
cycle. However, the coecients for the 2D view are calculated in oating point
precision on the PowerPC microprocessor. After calculation, these coecients
are transformed to a xed point representation and sent through the crossbar
to the rendering pipeline on the main FPGA. In fact, the software part of the
application controls the rendering-pipeline in real-time and is used to precompute coecients in oating point format.
The point rendering throughput of
60 million pixel per second is independent of the camera view but limited by
the memory bandwidth required to read pixels from memory. Because each 3D
pixel has a word size of 12 byte the resulting rendering throughput is reduced
to 3 million pixel per second.
The results presented in this thesis indicate considerable promise for the integration of partial design ow support into future FPGA software tools.
If
run-time partial reconguration is truly to become a familiar object in mainstream FPGA designs, the FPGA's I/O pin layout and the software tool support
itself will need to be specically designed to support these features in greater
depth. The ESM architecture and its platform tools represents an advance in
this direction. However, current shortcomings, like the design ow and the debug support for partial modules, may hinder the widespread adoption of partial
reconguration in industrial designs. It can be hoped that further research will
continue to address these issues and ultimately clarify whether partial reconguration is a good alternative or if recent developments in stream-computing,
massive multicore processor architectures will be the better technology.
6.2. Interdisciplinary Research Platform
Built in order to make partial hardware reconguration become a reality, the
Erlangen Slot Machine platform has shown its benets as a generic interdisciplinary platform [1, 2] that is being used in several quite dierent application
elds and research projects:
• Recongurable Networks (ReCoNets):
In the ReCoNets project, recong-
urable nodes are connected together to form a network of recongurable
computers [147].
Novel procedures for self-repair and intelligent parti-
tioning were developed to achieve a higher level of fault tolerance.
140
In
6.2. Interdisciplinary Research Platform
order to guarantee short repair times in case of node defects, the placement of tasks is optimized and replicated nodes are created [66].
The
ESM platform has been integrated and used in this network. Applications
taken from automotive networking have been shown to provide sophisticated implementations for hardware and software tasks that may migrate
within the network.
• Recongurable Operating Systems (ReCoNos):
The group of Prof. Platzner
developed new aspects of operating systems for recongurable hardware
based on the ESM platform. Hereby, it was shown for the rst time that
operating system resources could be shared between software programs
and recongurable hardware modules, e. g. for synchronization [148].
• Partial Module Visualization:
The group of Prof. Becker is known for their
research on dynamic 2D routing and placement. The ESM platform provided here an ideal experimentation platform due to its large FPGA without integrated processors and the unfragmented resources. The external
PowerPC was applied for on-line reconguration of the routing calculations. Furthermore, a visualizer of recongurable modules was developed
and demonstrated at FPL 2008 [149].
• Recongurable Video-Engines (AutoVision):
The ESM was also applied
to develop a recongurable driver assistance system. The group of Prof.
Stechele working on recongurable video engines which adopt to the current driving situation in order to increase driving comfort and prevent car
accidents. The ESM platform was applied because of its exibility, and the
sucient available memory. Results of this joint work have been published
in [150, 151].
Notably, partially recongurable video engines applied to
automotive applications were demonstrated jointly at the CeBIT 2008, as
shown in Figure 5.10.
• Partitioning Strategies:
The group of Prof. Merker applied the ESM for
the implementation of parallel algorithms, because 1) the FPGA provided
sucient resources for the implementation, 2) local SRAM allowed the
implementation of tasks, which needed a lot of local storage, and 3) the
communication structures of the ESM oered new opportunities for the
exchange of data between tasks. Furthermore, the ESM was used to develop new partitioning strategies [152].
141
6. Conclusions
• Task Preemption:
Despite the possibility to execute several hardware tasks
in parallel on an FPGA, partial reconguration runs typically sequential.
There exists only one reconguration port which is used exclusively during
the reconguration of a hardware task on all available platform.
Single
processor scheduling algorithms for task reconguration with preemption
have been evaluated in a real-time application implemented on the Erlangen Slot Machine. Besides allowing recongurable connections of peripherals to pins of the FPGA, the Virtex II FPGA of the ESM allows to host
applications requiring quite a large number of slots. This has been used
to study and develop preemption in the reconguration phase, see [153].
• Security of ECC implementations:
The Erlangen Slot Machine was nally
also used in the project of securing ECC implementations against dierential power analysis by Prof. Huss [154].
6.3. Future Work
There are many possible directions for future research. We will touch on a few
directions that could be explored based on the conclusions of this thesis.
The memory subsystem of any given platform is xed for its lifetime and can
be a performance bottleneck for a number of applications, especially if partial
hardware applications from dierent domains are run on the same platform.
In this case the recongurable platform must implement a memory subsystem
satisfying the most common use. However, some applications beneting from
partial reconguration will not be able to run at full speed due to a sub optimal
memory architecture. The question is how to increase run-time customization
of the memory subsystem without too much overhead.
One possible solution
could be a recongurable multi-port memory controller with adjustable caching
support.
Another aspect of future work is the update of the ESM platform to newest
FPGA technology such as a Virtex 5 or Virtex 6 architecture [27]. The open
question is whether the exibility of the external crossbar on the ESM Motherboard can be replaced through an internal structure inside the new FPGA,
without increasing the complexity of the design ow for partial modules. Using a newer Xilinx Virtex 5FX family would also allow implementation of the
142
6.3. Future Work
operating system directly inside the FPGA and use of the 32 bit ICAP reconguration interface, instead of the external 8 bit SelectMAP reconguration
interface found in Virtex-II FPGAs.
Research motivation for the future includes an interesting, but still open question about the successful use of partial reconguration in embedded applications
targeting aerospace applications. One interesting use could be the detection and
recovery from single event upset faults caused by cosmic radiation in the numerous SRAM cells inside an FPGA. Potentially, with the help of partial reconguration only the corrupted region of the FPGA could be recongured, while
at the same time the unaected majority of the FPGA circuit could continue to
operate without any interrupt [155]. In this case, partial reconguration could
be used to heal hardware regions of the FPGA during run-time.
143
6. Conclusions
144
A. Glossary
Area Group Constraints
Area Group constraints are used to link dierent
design instances for grouped placement.
in the same region.
All grouped elements will be placed
The size and shape of that region is dened through an
additional area group range constraint. Each partial design must have at least
two area groups.
One area group constrains the base design while the other
constrains all instances included in the partial recongurable region.
Area Group Range Constraints
After the denition of area group constraints,
the shape, size and position of each area group must be specied.
The area
group constraints dene the slice range and BlockRAM range for each partial
recongurable region.
Base Design
The base design contains the entire design aside from the partial
modules. The base design contains the static part of the design and remains in
operation during the dynamic reconguration.
Bitle
Bitles are a synonym for Bitstreams.
Bitstream
After a hardware logic design has been synthesized, mapped, placed
and routed the device specic conguration data can be generated. This conguration data is called bitstream or bitle.
the conguration data.
It refers to the le containing
For the hardware logic design to start operation the
bitstream of this design has to be loaded into the FPGA.
145
A. Glossary
Bus-macros
Bus-macros are pre-placed, pre-routed hard-macro blocks that
lock signals between partial and static modules into dened positions. They are
required by the PR design ow.
Dynamic Reconguration
Full Bitstream
Synonym for Run-Time Reconguration.
A Full Bitstream contains the conguration data of the base
design, as well as conguration data for the partial reconguration module. It
is used to power-up a partially recongurable design.
Hardware Task
Synonym for Partial Reconguration Module (PRM). How-
ever, the term Hardware Task is used to emphasize the dynamic nature, exibility and analogy to software tasks. Hardware tasks are partially recongurable
modules with an additional control interface.
Modular Design
Modular Design is a development style that is coupled to
a vendor specic design ow and allows designs to be broken into independent
modules. These modules can then be coded and synthesized separately.
O-line and On-line Algorithms
An on-line algorithm can process its input
information piece-by-piece, without having the entire input available from the
start.
In many cases real-time constraints on the processing time have to be
considered in embedded systems. This reduces the computational complexity
that an on-line algorithm can perform. In contrast, an o-line algorithm is given
the whole problem data from the beginning and is required to output an answer
which solves the problem at hand. In most cases the memory and computational
demand for an o-line algorithm does not aect an embedded system as only
the solution of an o-line algorithm is implemented in an embedded system.
Partial Bitstream
During the PR ow a bitstream is generated for each PR
module inside the design.
They are called partial bitstreams as they contain
only the conguration data of a single module. A partial bitstream can only be
loaded after a full bitstream.
146
Partial Reconguration (PR)
Partial reconguration is the process of repro-
gramming only a subset of the FPGA device at run-time. Partial reconguration is performed while the device is active. The programming process does not
interfere with active logic on the device.
Partial Reconguration Module (PRM)
Design modules that can be swapped
in and out of the device on the y (at run-time) are referred to as partial reconguration modules, or PRMs. Multiple PRMs can be dened for one region,
but a PRM cannot only belong to multiple partial recongurable regions.
Partial Recongurable Region (PRR)
A specic part of the FPGA reserved
for a partial reconguration module is called partial recongurable region, or
PRR. Area group range constraints are used to dene the size, shape and position of a PRR. Area group constraints are used to link a PR module with a
specic PR region.
Recongurable Application
As shown in Figure 4.6, a recongurable appli-
cation running on the ESM includes a custom scheduler and placer as well as a
pool of hardware tasks.
Recongurable Computing (RC)
Recongurable computing employs the use
of a recongurable device for the acceleration of computing intensive application.
The recongurable device typically supports run-time reconguration of partial
regions and can be an FPGA, a microprocessor with a recongurable unit, or
coarse-grained device.
Relocation
Relocation enables the placement of partial modules into other
recongurable regions than the one that was specied during the bitstream
generation process. Relocation is performed prior bitstream loading and during
this process specic osets inside the bitstream are modied to reect the new
region on the FPGA.
Run-time Full Reconguration
The recongurable device is restarted after
a new conguration for the whole device has been reprogrammed at run-time.
This is shown in shown in Figure 1 b).
147
A. Glossary
Run-time Reconguration (RTR)
The recongurable device is reprogrammed
at run-time. Means either Run-time Full Reconguration or Partial Reconguration (PR) as shown in Figure 1 b) and c).
148
B. Technical Specication of the
ESM
Main FPGA
Xilinx Virtex II 6000
RCM FPGA
Xilinx Spartan-IIE 400
CPLD
Xilinx XCR 3128XL
Crossbar link
264 bits at 50 MHz add up to 13.2 Gbps
Memory
SRAM
6 modules of 2 MByte each
asynchronous SRAM (ISSI IS61LV10248)
DDR SDRAM
up to 512 MByte (not implemented yet)
Flash
64 MByte (Samsung K9F1208UOM)
Debugging
22 bits Debug_IO, JTAG
I/Os
General purpose
117 bits
EPP
8 bits
External clock
Table B.1.:
Cypress CY22393FC
Technical specication of the ESM Babyboard
149
B. Technical Specication of the ESM
Crossbar FPGA
Xilinx Spartan-IIE 600
Video-Out FPGA
Xilinx Spartan-IIE 400
Embedded PowerPC
MPC875 (100 MHz)
Crossbar link
264 bits at 50 MHz accord to 13.2 Gbps
Memory
PowerPC SDRAM
2 x 32 MByte (Samsung K4S561632E-TL75)
PowerPC ash
4 x 4 MByte (AMD AM29LV320DB90WMC)
Crossbar SDRAM
32 MByte (Samsung K4S561632E-TL75)
Graphic SDRAM
2 x 8 MByte (Micron MT48LC4M16A2)
FPGA ash
8 MByte (Xilinx XCF08PVO48CES)
Debugging
BDM, JTAG
IEEE1394
2 x FireWire (not implemented yet)
Audio
3,5mm analog Stereo (not implemented yet)
Video
Composite video in and VGA out
I/Os
S-Video in, composite video out, S-Video out, and
DVI out
Ethernet
100 Mbit connected to the PowerPC
USB
USB 1.0 connected to the PowerPC
Audio
Audio Codec 97 (Cirrus Logic CS4202-JQ)
Controller
Video
24-bit RAMDAC (Analog Devices ADV7125JST330)
DVI transmitter (TI TFP410PAP)
RGB-NTSC/PAL encoder (Analog Devices AD725AR)
9-bit video input processor (Philips SAA7113H)
Conguration
JTAG, Flash, and BDM
Clock PLLs
Cypress CY22393FC
Cypress CY2300SC
ICS ICS307-02
Table B.2.:
150
Technical specication of the ESM Motherboard
List of Figures
1.1.
The architecture of the Xilinx Virtex family of FPGAs allows design modules to be swapped on-the-y using a
ration
Partial Recongu-
(PR) methodology [20, 21]. Each partial module is placed
in a predened area called PR region. This allows multiple design
modules to time-share resources on a single device, while the base
design and and all external links continue to operate uninterrupted. 17
1.2.
Dierent reconguration modes supported by the ESM platform:
a) static full reconguration, b) run-time full reconguration, and
c) run-time partial reconguration.
1.3.
. . . . . . . . . . . . . . . .
18
The feed-through line problem with relocatable modules. Placing a new module B into slot two requires that the new module
provides all feed-through lines needed by slot one and three. This
fact disables any module relocation and makes it impossible to
place modules with dierent feed-through requirements into the
other slots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.
20
Pin distribution of a VGA module on the RC200 platform. It can
be seen that the VGA Module occupies pins on the bottom and
right FPGA borders. In consequence, only a narrow part on the
left side is available for dynamic module reconguration.
1.5.
Overview of a recongurable computing platform.
. . . .
22
The recon-
gurable hardware device is controlled by an operating system
which loads partial tasks on request.
. . . . . . . . . . . . . . .
23
2.1.
Basic logical structure of an FPGA device. . . . . . . . . . . . .
30
2.2.
Global view of the array structure inside an Xilinx Virtex-II
FPGA. Note that the interconnect between the CLBs is not
shown but comprises 80% to 90% of the total chip area
[65, 56].
32
151
List of Figures
2.3.
Internal structure of a Congurable Logic Block and a slice element. The left gure shows that a CLB consists of four slices
and a switch matrix for long distance connections [25]. The right
gure depicts the internal structure of a slice. It can be congured to implement logic functions or used as a memory element.
Each slice contains two registers (Flip-Flops).
2.4.
. . . . . . . . . .
33
Usage of bus-macros inside a Virtex-II FPGA between partial recongurable modules (PRMs) and the static base design or other
partially recongurable modules.
2.5.
. . . . . . . . . . . . . . . . .
Example of a coarse-grained recongurable architecture WPPA
with parameterizable processing elements (WP PEs) [72, 73]. . .
3.1.
36
37
ESM architecture overview with main FPGA, crossbar and an
external PowerPC microprocessor for system control functions.
The architecture of the Babyboard is further rened in Figure
3.7. The Motherboard is shown in Figure 3.12. . . . . . . . . . .
3.2.
47
Inter-module communication possibilities on the ESM: a) busmacro, b) shared memory, c) recongurable multiple bus (RMB),
d) external crossbar. Hardware modules can also with software
running on the PowerPC microprocessor via the crossbar.
3.3.
ESM slot architecture with six macro-slots (S1, S2, ...
. . .
49
S6). In
order to allow access to the RMB crosspoints (CP) and SRAM
banks, one macro slot consists of three micro-slots.
one micro-slot occupies exactly four CLB columns.
3.4.
Physically,
. . . . . . .
50
Schematic diagram of the ESM shows the implemented two-board
solution with an FPGA Babyboard and a supporting Motherboard.
3.5.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
ESM implementation of the FPGA Babyboard and the supporting Motherboard. On top of the Motherboard sits the Babyboard
with the Virtex-II 6000 FPGA. Additional technical data and examples are available at
3.6.
. . . . . . . . . .
54
Slot architecture of the main FPGA with macro-slots built from
micro-slots.
3.7.
http://www.r-space.de.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
The main components of the Babyboard are the main FPGA
for user applications, a Reconguration Manager (RCM) FPGA
for conguration management, and a CPLD for the initialization
routines after power-up.
3.8.
152
. . . . . . . . . . . . . . . . . . . . . .
The ESM Babyboard and its components.
. . . . . . . . . . . .
57
58
List of Figures
3.9.
Simple reconguration manager architecture. . . . . . . . . . . .
63
3.10. Architecture of the ESM reconguration manager with plug-ins
such as Flash, ECC, module relocator and other possible plug-ins.
3.11. Four dierent workload scenarios for the reconguration manager.
65
66
3.12. The main component of the Motherboard is the Crossbar FPGA
which connects all peripherals, PowerPC, and Video-Out FPGA
with the main FPGA on the Babyboard.
. . . . . . . . . . . .
3.13. The ESM Motherboard and its components.
. . . . . . . . . . .
67
68
3.14. Internal data ow structure of the crossbar FPGA with the currently implemented units and associated signals. The
PPCcom
module can directly access the conguration registers of the Crossbar module which are used to program the requested connection
the main FPGA and the peripheral devices.
4.1.
. . . . . . . . . . .
73
Partial recongurable design with a single partial recongurable
region, PR Region A. Partial reconguration modules PRM A1,
A2, A3 can be loaded into PR Region A. All PRMs of the same
PR Region must have the same communication interface but
there are no constraints on what logic is implemented inside the
module.
4.2.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
79
The Partial Reconguration design ow consist of seven steps.
HDL design description and synthesis is the rst step. The constrain step (2) can be rened after the optional non-PR implementation (place and route) step (3) of the top-level design.
Main sources of problems are violations in Area Group (AG)
constraints. The implement base design step (5) combines busmacros, the static part and I/O constraints in a base design. In
step six all PR Modules are placed and routed within their Area
Group constraints. Merge step (7) creates the bitstreams for the
base design and all PR Modules. . . . . . . . . . . . . . . . . . .
4.3.
82
Based on a modular design SlotComposer automatically inserts
and places bus-macros inside the top-level VHDL design.
Bus-
macros are correctly connected in between static and partial modules. The shape of a partial module can be changed to create valid
locations for bus-macros. Then a new project directory structure
is created together with the partial design script for partial and
base bitstream generation.
. . . . . . . . . . . . . . . . . . . .
84
153
List of Figures
4.4.
SlotComposer application allows to convert modular VHDL designs into partial designs. After the selection of the project directory, user constraints le, FPGA device type and bus-macros
the project can be converted to adhere to the PR design ow.
4.5.
.
85
SlotComposer application allows to convert modular VHDL designs into partial designs. This window of SlotComposer shows
one static module on the left and three partial modules on the
right side. Bus-macros are shown as small boxes connecting these
modules together. The absolute placement of bus-macros and all
modules is represented by the grid position measured in slices.
4.6.
87
Firmware stack developed for the Erlangen Slot Machine. A recongurable application running on the ESM includes a custom
scheduler and placer as well as a pool of hardware tasks. Hardware tasks are partially recongurable modules with an additional
control interface.
4.7.
. . . . . . . . . . . . . . . . . . . . . . . . . .
88
Time line showing the arrival of a task request, its reconguration
and execution time. The execution is enabled separately through
the enable signal
Ei .
An example of an active device supporting
partial reconguration at run-time is shown in Figure 1 c).
. .
93
. . . .
93
4.8.
State diagram showing the life cycle of a hardware task.
4.9.
Generated hardware task set consisting of three modules (HWT1, HW-T2, HW-T3) with dierent module widths. All signals
between the static part and modules pass through bus-macros.
95
4.10. Measured reconguration times for generated hardware tasks with
dierent module widths. Hereby, a constant time overhead of 3
ms was resulting from the software layer.
. . . . . . . . . . . .
98
4.11. Schedule produced for the example problem by our scheduling
simulator. The brightly shaded rectangular areas stand for the
reconguration times
times
Ci .
Ri ,
the green ones for the core execution
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
99
4.12. Simplied structure of a video processing application designed
for the ESM platform.
In its basic form, the video processing
module is connected to an input and output module. These three
modules reside on the main FPGA and require external memory.
The communication to and from the main FPGA is controlled by
the Crossbar FPGA.
154
. . . . . . . . . . . . . . . . . . . . . . . .
102
List of Figures
5.1.
A modular architecture for video streaming as implemented on
the slot-based structure of the ESM.
5.2.
. . . . . . . . . . . . . . .
108
The data ow chart of the overall system with resource bindings.
The deinterlacing must be done on the main FPGA as the single
SDRAM module at the Crossbar does not support the required
throughput.
5.3.
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
110
Implementation of partially recongurable image processing engines on the ESM. The video signals occupy more than half of the
Crossbar I/Os. The blue shaded slots are assigned to the static
part and the red shaded region is used by the recongurable video
module, also called engine. The seven slots on the right and the
two connected SRAMs can be used for other recongurable or
static hardware modules.
5.4.
111
Basic image lters implemented as partially recongurable modules on the ESM.
5.5.
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
114
The Edge-Engine enhances the camera data by displaying the
edges in the image and marking the lane with green lines. The red
pixels indicate possible obstacles as they will appear as horizontal
edges. The more red pixels are shown over an object the more
likely an obstacle was found.
5.6.
. . . . . . . . . . . . . . . . . . .
115
The result of the Taillight-Engine demonstration at CeBIT 2008,
more details in Section 5.2.7. All found lights are highlighted by
a green spot each. If two lights match in size and brightness, are
located at the same height, and have car-like motion vectors, they
are identied as a pair belonging together, and a red car marker
is placed between them.
5.7.
. . . . . . . . . . . . . . . . . . . . . .
From left to right: a) Light pattern matrix in
117
Spotlight-Engine,
b) applied to taillight in image and c) applied to a lane in image. 118
5.8.
HW/SW partitioning of the Taillight-Engine on the ESM.
. . .
5.9.
The implementations of the video applications on the Virtex-
120
II 6000 by comparison: Contrast lter (left), EdgeEngine (center), and TaillightEngine (right) . . . . . . . . . . . . . . . . . .
5.10. CeBIT 2008 group picture with Prof.
121
Walter Stechele, Rafael
Pohlig, Christopher Claus, Matthias Kovatsch and Mateusz Majer (from left to right).
. . . . . . . . . . . . . . . . . . . . . . .
123
155
List of Figures
5.11. Direct point rendering is the simplest 3D rendering method. The
points are assumed to be samples of a surface and are transformed
to the 2D screen space.
The necessary pipeline is a simplied
polygon rendering pipeline.
. . . . . . . . . . . . . . . . . . . .
125
5.12. Overview of the main signal ow through the point rendering
pipeline. The ESM implementation of the point-based rendering
pipeline is shown in Figure 5.14.
. . . . . . . . . . . . . . . . .
126
5.13. Design overview of the main signal ow on the ESM platform.
Annotated are the signal bit widths and clock frequencies. The
implementation of the point-based rendering pipeline is shown in
Figure 5.14.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
129
5.14. The complete point-rendering pipeline implemented on the VirtexII 6000 FPGA. Data ows from top to bottom and includes point
data and control signals. Each pipeline element can be stalled.
132
5.15. Rendered Venus point model screenshots a) without and b) with
shading (45,357 points). The pictures were directly taken from
the VGA output of the ESM platform shown in Figure 3.5.
156
. .
135
List of Tables
2.1.
Conceptual dierences between recongurable hardware and microprocessors depicted with the help of architectural key parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
2.2.
Technical data of the Virtex-II 6000 FPGA from Xilinx [25].
32
3.1.
Theoretical data bandwidth and signal latency for the four sup-
. .
ported communication schemes. Variable CP denotes the number
of RMB Cross Points that are traversed.
. . . . . . . . . . . . .
52
3.2.
Interface of the main FPGA . . . . . . . . . . . . . . . . . . . .
60
3.3.
Interface between the PowerPC and the Crossbar FPGA
. . . .
70
3.4.
Signal interface of the Crossbar FPGA.
. . . . . . . . . . . . .
71
4.1.
Reconguration overhead on the ESM platform for dierently
sized partial modules. All hardware tasks are loaded from ash
memory directly into the main FPGA. The software overhead is
very small because only one command has to be sent to the Reconguration Manager to load a partial module from ash memory. 98
5.1.
State vector information of the point-rendering pipeline which
control the complete rendering process. . . . . . . . . . . . . . .
5.2.
131
Hardware resource utilization for the point-rendering pipeline.
No details on the clock frequency can be given for BackfaceCulling, Clipping and Color Selection since these modules are
too small. The last column (Vars.) shows which multiplier variant was used respectively to implement the multiplications of the
transformation.
. . . . . . . . . . . . . . . . . . . . . . . . . . .
134
B.1. Technical specication of the ESM Babyboard . . . . . . . . . .
149
B.2. Technical specication of the ESM Motherboard . . . . . . . . .
150
157
List of Tables
158
Bibliography
[1] S. Fekete, T. Kamphans, N. Schweer, C. Tessars, J. van der Veen, A. Ah-
ReCoNodes
- Optimization Methods for Module Scheduling and Placement on Recongurable Hardware Devices, M. Platzner, J. Teich, and N. Wehn, Eds.
madinia, J. Angermeier, D. Koch, M. Majer, and J. Teich,
Springer, Heidelberg, Feb. 2010.
Erlangen Slot Machine: An FPGA-Based Dynamically Recongurable Computing Platform,
[2] J. Angermeier, C. Bobda, M. Majer, and J. Teich,
M. Platzner, J. Teich, and N. Wehn, Eds. Springer, Heidelberg, Feb. 2010.
[3] SPP1148 Recongurable Computing Priority Program, Online:
http://
www12.informatik.uni-erlangen.de/spprr, 2008.
[4] M. Majer, J. Teich, and C. Bobda, ESM - the Erlangen Slot Machine,
http://www.r-space.de, 2008.
[5] U. Batzer, Hardware-Software-Co-Design von Echtzeitbilderkennungsalgorithmen auf die Erlangen Slot Machine (ESM), Project Thesis, University of Erlangen-Nuremberg, Department of CS 12, Hardware-SoftwareCo-Design, Apr. 2008.
[6] M. Kovatsch, Entwurf und Test von Speicherinterfaces für Module für
die Bildverarbeitung auf der Erlangen Slot Machine (ESM), Project Thesis, University of Erlangen-Nuremberg, Department of CS 12, HardwareSoftware-Co-Design, May 2008.
[7] B. Kleinert, Kernelmodularchitektur für den Rekongurationsmanager
der Erlangen Slot Machine (ESM), Studienarbeit, University of ErlangenNuremberg, Department of CS 12, Hardware-Software-Co-Design, Aug.
2007.
159
Bibliography
[8] T. Stark, Entwurf und Implementierung einer Treiberarchitektur und
ESM-Shell für die Erlangen Slot Machine (ESM), Diplomarbeit, University of Erlangen-Nuremberg, Department of CS 12, Hardware-SoftwareCo-Design, Feb. 2007.
[9] P. Shterev, SlotComposer Design and Implementation of an Automated
Design Flow for Partially Recongurable FPGA Modules, Master Thesis, University of Erlangen-Nuremberg, Department of CS 12, HardwareSoftware-Co-Design, Sep. 2007.
[10] J. Grembler, Dynamisch partiell rekongurierbare Videomodule auf der
Erlangen Slot Machine (ESM), Diplomarbeit, University of ErlangenNuremberg, Department of CS 12, Hardware-Software-Co-Design, Sep.
2006.
[11] C. Freiberger, Reconguration Manager for the Erlangen Slot Machine
(ESM), Diplomarbeit, University of Erlangen-Nuremberg, Department of
CS 12, Hardware-Software-Co-Design, Oct. 2006.
[12] F. Reimann, Entwurf und Implementierung eines Recongurable Multiple Bus für die Erlangen Slot Machine (ESM), Studienarbeit, University
of Erlangen-Nuremberg, Department of CS 12, Hardware-Software-CoDesign, Aug. 2006.
[13] P. Asemann, Reconguration Manager for the Erlangen Slot Machine
(ESM), Diplomarbeit, University of Erlangen-Nuremberg, Department
of CS 12, Hardware-Software-Co-Design, Oct. 2005.
[14] A. Linarth, Entwurf und Entwicklung eines Motherboards für die Erlangen Slot Machine (ESM), Project Thesis, University of ErlangenNuremberg, Department of CS 12, Hardware-Software-Co-Design, May
2005.
[15] T. Haller, Entwurf und Entwicklung einer FPGA Platine für dynamische Rekonguration, Studienarbeit, University of Erlangen-Nuremberg,
Department of CS 12, Hardware-Software-Co-Design, Jun. 2005.
[16] S. L. Steinfadt and J. Baker, GPU Computing for the SWAMP Sequence
Alignment, in
pp. 115.
160
Ohio Collaborative Conference on Bioinformatics,
2008,
Bibliography
[17]
Board Specication - Tesla C1060 Computing Processor Board,
NVIDIA
Corporation, 2008.
[18]
BDTI Communication Benchmark (OFDM) Results,
Berkeley Design
Technologies Inc., 2008.
[19] T. El-Ghazawi, High-Level Languages for Recongurable Computers: A
Comparative View, in
ing Technologies,
ARSC Symposium on Multicore and New Process-
2007.
[20]
XILINX JTRS/SDR Announcement,
[21]
Early Access Partial Reconguration User Guide, UG208,
Xilinx Inc., 2006.
Xilinx Inc.,
Mar. 2006.
[22] V. Baumgarte, F. May, A. Nückel, M. Vorbach, and M. Weinhardt, PACT
XPP - a self-recongurable data processing architecture, in
ERSA,
Las
Vegas, Nevada, Jun. 2001, pp. 167184.
[23] T. Toi, N. Nakamura, Y. Kato, T. Awashima, K. Wakabayashi, and
L. Jing, High-level synthesis challenges and solutions for a dynamically
Proceedings of the International Conference
on Computer-Aided Design (ICCAD), 2006.
recongurable processor, in
[24] A. Moonen,
C. Bartels,
M. Bekooij,
R. van den Berg,
H. Bhullar,
K. Goossens, P. Groeneveld, J. Huiskens, and J. van Meerbergen, Comparison of an Aethereal network on chip and traditional interconnects two case studies, in
on Chip,
VLSI-SoC: Research Trends in VLSI and Systems
ser. IFIP International Federation for Information Processing,
G. De Micheli, S. Mir, and R. Reis, Eds.
Springer, 2007, no. 249.
[25]
Virtex-II Platform FPGA User Guide V2.0,
[26]
Virtex-4 User Guide V1.5,
Xilinx Inc., 2006.
[27]
Virtex-5 User Guide V1.2,
Xilinx Inc., 2006.
Xilinx Inc., 2005.
[28] N. Dorairaj, E. Shiet, and M. Goosman, PlanAhead Software as a Platform for Partial Reconguration,
[29]
XCell Journal,
vol. 4, pp. 6871, 2005.
RC2000 Development Board, http://www.celoxica.com/products/boards/rc2000.asp,
Celoxica Ltd., 2004.
161
Bibliography
[30]
ADM-XRC-II Xilinx Virtex-II PMC,
online, Xilinx Inc., http://www.
alpha-data.com/adm-xrc-ii.html, Alpha Data Ltd., 2002.
[31] C. Steiger, H. Walder, M. Platzner, and L. Thiele, Online scheduling and
placement of real-time tasks to partially recongurable devices, in
Pro-
ceedings of the 24th International Real-Time Systems Symposium,
Can-
cun, Mexico, December 2003, pp. 224235.
[32] H. Walder, S. Nobs, and M. Platzner, Xf-board: A prototyping platform
Proceedings of the 4th
International Conference on Engineering of Recongurable Systems and
Architectures (ERSA). CSREA, 2004.
for recongurable hardware operating systems, in
[33] H. Walder, C. Steiger, and M. Platzner, Fast online task placement on FP-
Proceedings of the 17th
International Parallel and Distributed Processing Symposium (IPDPS) /
Recongurable Architectures Workshop (RAW). IEEE Computer Society,
GAs: Free space partitioning and 2d-hashing, in
April 2003, pp. 178186.
[34] H. Kalte, M. Porrmann, and U. Rückert, A prototyping platform for
Proceedings of
the IEEE Workshop Heterogeneous recongurable Systems on Chip (SoC),
dynamically recongurable system on chip designs, in
Hamburg, Germany, Sep. 2002.
[35] C. Bobda, A. Ahmadinia, M. Majer, J. Teich, S. Fekete, and J. van der
Veen, DyNoC: A Dynamic Infrastructure for Communication in Dynami-
Proceedings of the International Conference on Field-Programmable Logic and Applications, Tampere, Finland,
cally Recongurable Devices, in
Aug. 2005, pp. 153158.
[36] S. P. Fekete, J. C. van der Veen, J. Angermeier, C. Göhringer, M. Majer,
and J. Teich, Scheduling and communication-aware mapping of HW/SW
modules for dynamically and partially recongurable SoC architectures,
ARCS '07 - 20th International Conference on Architecture of Computing Systems 2007. VDE-Verlag, Berlin, 2007, pp. 151160.
in
[37] S. Fekete, J. van der Veen, A. Ahmadinia, D. Göhringer, M. Majer, and
J. Teich, Oine and Online Aspects of Defragmenting the Module Layout of a Partially Recongurable Device,
vol. 16, no. 9, pp. 12101219, 2008.
162
IEEE Transactions on VLSI,
Bibliography
[38] C. Bobda, M. Majer, A. Ahmadinia, T. Haller, A. Linarth, and J. Teich,
The Erlangen Slot Machine (ESM): A Flexible Platform for Dynamic
Board Demo at the University Booth at
Design, Automation and Test in Europe (DATE 2005), Munich, Germany,
Recongurable Computing, in
Mar. 2005.
[39] C. Bobda, M. Majer, A. Ahmadinia, T. Haller, A. Linarth, J. Teich, S. P.
Fekete, and J. van der Veen, The Erlangen Slot Machine:
A Highly
Proceeding 2005 IEEE
Symposium on Field-Programmable Custom Computing Machines, Apr.
Flexible FPGA-Based Recongurable Platform, in
2005, pp. 319320.
[40] M. Majer, An FPGA-Based Dynamically Recongurable Platform: from
Concept to Realization, in Proceedings of 16th International Conference
on Field Programmable Logic and Applications, Madrid, Spain, Aug. 2006,
pp. 963964.
[41] J. Angermeier, D. Göhringer, M. Majer, S. Teich, Jürgenand Fekete, and
J. van der Veen, The Erlangen Slot Machine - A Platform for Interdisciplinary Research in Recongurable Computing,
nology,
it - Information Tech-
vol. 49, no. 3, pp. 143148, 2007.
[42] M. Majer, J. Teich, A. Ahmadinia, and C. Bobda, The Erlangen Slot
Machine: A Dynamically Recongurable FPGA-Based Computer,
nal of VLSI Signal Processing Systems,
Jour-
vol. 47, no. 1, pp. 1531, Mar.
2007.
[43] M. Majer, A. Ahmadinia, C. Bobda, and J. Teich, A Flexible Reconguration Manager for the Erlangen Slot Machine, in
urable Systems Workshop.
Dynamically Recong-
Frankfurt (Main), Germany: Springer, Mar.
2006, pp. 183194.
[44] C. Bobda, M. Majer, A. Ahmadinia, T. Haller, A. Linarth, and J. Teich,
Increasing the Flexibility in FPGA-Based Recongurable Platforms: The
Erlangen Slot Machine, in
Technology (FPT),
IEEE 2005 Conference on Field-Programmable
Singapore, Singapore, Dec. 2005, pp. 3742.
[45] D. Göhringer, M. Majer, and J. Teich, Bridging the Gap between Relocatability and Available Technology: The Erlangen Slot Machine, in
Dynamically Recongurable Architectures, ser. Dagstuhl Seminar Proceedings, P. M. Athanas, J. Becker, G. Brebner, and J. Teich, Eds., no.
163
Bibliography
06141.
Internationales Begegnungs- und Forschungszentrum fuer Infor-
matik (IBFI), Schloss Dagstuhl, Germany, 2006.
[46] A. Ahmadinia, C. Bobda, J. Ding, M. Majer, J. Teich, S. Fekete, and
J. van der Veen, A Practical Approach for Circuit Routing on Dynamic
Proceedings of the 16th IEEE International
Workshop on Rapid System Prototyping (RSP), Montreal, Canada, Jun.
Recongurable Devices, in
2005, pp. 8490.
[47] S. Fekete, J. van der Veen, M. Majer, and J. Teich, Minimizing com-
Proceedings of 16th
International Conference on Field Programmable Logic and Applications
(FPL06), Madrid, Spain, Aug. 2006, pp. 535540.
munication cost for recongurable slot modules, in
[48] C. Bobda, A. Ahmadinia, M. Majer, J. Ding, and J. Teich, Modular
Video Streaming on a Recongurable Platform, in
IFIP VLSI-SOC 2005,
Perth, Australia, Oct. 2005, pp. 103108.
[49] J. Angermeier, U. Batzer, M. Majer, J. Teich, C. Claus, and W. Stechele,
Recongurable HW/SW Architecture of a Real-Time Driver Assistance
Proceedings of the Fourth International Workshop on Applied
Recongurable Computing (ARC), ser. Lecture Notes in Computer Science
System, in
(LNCS).
London, United Kingdom: Springer, Mar. 2008, pp. 149159.
[50] M. Majer, S. Wildermann, J. Angermeier, S. Hanke, and J. Teich, CoDesign Architecture and Implementation for Point-Based Rendering on
Proc. 19th IEEE/IFIP International Symposium on Rapid
System Prototyping (RSP 2008), Monterey, USA, Jun. 2008, pp. 142148.
FPGAs, in
[51] C. Bobda, M. Majer, D. Koch, A. Ahmadinia, and J. Teich, A Dynamic
Proceedings of International Conference on Field-Programmable Logic and
Applications (FPL), ser. Lecture Notes in Computer Science (LNCS), vol.
NoC Approach for Communication in Recongurable Devices, in
3203.
Antwerp, Belgium: Springer, Aug. 2004, pp. 10321036.
[52] A. Ahmadinia, C. Bobda, M. Majer, J. Teich, S. Fekete, and J. van der
Veen, DyNoC: A Dynamic Infrastructure for Communication in Dynam-
Proceedings of the International Conference on Field-Programmable Logic and Applications (FPL), Tampere,
ically Recongurable Devices, in
Finland, Aug. 2005, pp. 153158.
164
Bibliography
[53] M. Majer, C. Bobda, A. Ahmadinia, and J. Teich, Packet Routing in
Dynamically Changing Networks on Chip, in
Architectures Workshop (RAW 2005),
IPDPS 12th Recongurable
Denver, USA, Apr. 2005, pp. 154
160.
[54] C. Bobda, M. Majer, D. Koch, A. Ahmadinia, and J. Teich, Task Schedul-
Proceedings of the
17th Symposium on Integrated Circuits and Systems Design (SBCCI).
ing for Heterogeneous Recongurable Computers, in
Pernambuco, Brazil: ACM Press, Sep. 2004, pp. 2227.
[55] J. van der Veen, S. Fekete, M. Majer, A. Ahmadinia, C. Bobda, F. Hannig, and J. Teich, Defragmenting the Module Layout of a Partially Re-
Proceedings of the International Conference on
Engineering of Recongurable Systems and Algorithms (ERSA 2005), Las
congurable Device, in
Vegas, NV, USA, Jun. 2005, pp. 92101.
Proceedings of the
1996 ACM fourth International Symposium on Field-Programmable Gate
Arrays, 1996, pp. 115121.
[56] A. DeHon, DPGA Utilization and Application, in
[57]
XC2000 Logic Cell Array Families,
[58]
XC6200 Field Programmable Gate Arrays Data Sheet,
[59]
Xilinx Inc., 1985.
Xilinx Inc., 1997.
RTOS industry leaders recognize Virtex-II Pro PowerPC and MicroBlaze
as leading FPGA processing solutions, Press release, Xilinx Inc., http:
//www.xilinx.com/prs_rls/partners/03165rtos.htm, Xilinx Inc., 2003.
[60] P. Lysaght, Dynamic Reconguration of FPGAs,
W. Moore and W. Luk, Eds.
in
More FPGAs,
Abingdon EE & CS Books, England, 1994.
[61] P. Lysaght and J. Stockwood, A Framework for Recongurable Computing: Task Scheduling and Context Management,
Very Large Scale Integration (VLSI) Systems,
IEEE Transactions on
vol. 4, no. 3, pp. 381390,
Sep. 1996.
[62] ProASIC3
FPGA,
Online:
Actel
Corp.,
http://www.actel.com/
proasic3/, 2008.
[63] LatticeXP FPGA, Online: Lattice Semiconductor Corp., http://www.
latticesemi.com/products/fpga/xp/, 2008.
165
Bibliography
[64] Axcelerator Antifuse FPGA, online, Actel Corp., http://www.actel.
com/products/axcelerator/, 2008.
[65] A. DeHon, Balancing interconnect and computation in a recongurable
Proceedings of the 1999 ACM/SIGDA seventh International Symposium on Field Programmable Gate Arrays, 1999, pp.
computing array, in
6978.
[66] D. Koch, C. Beckho, and J. Teich, A Communication Architecture
for Complex Runtime Recongurable Systems and its Implementation on
Proceedings of the 17th ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays (FPGA). Monterey, Cal-
Spartan-3 FPGAs, in
ifornia, USA: ACM, Feb. 2009, pp. 233236.
[67] F. Cancare, M. D. Santambrogio, and D. Sciuto, A design ow tailored
Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), 2008,
for self dynamic recongurable architecture, in
pp. 18.
[68] C. Ebeling, D. Cronquist, and P. Franklin, RaPiD - Recongurable
International Workshop on Field-Programmable
Logic and Applications (FPL), Darmstadt, Germany, vol. 1142. Springer
Pipelined Datapath, in
Lecture Notes in Computer Science, 1996, pp. 126135.
[69] S. C. Goldstein, H. Schmit, M. Moe, M. Budiu, S. Cadambi, R. R. Taylor, and R. Laufer, PipeRench: a coprocessor for streamin multimedia
acceleration, in
Proc. ISCA,
1999.
[70] B. Mei, S. Vernalde, D. Verkest, H. D. Man, and R. Lauwereins, ADRES:
An Architecture with Tightly Coupled VLIW Processor and CoarseGrained Recongurable Matrix, in
and Applications,
Proceedings Field-Programmable Logic
vol. 2778, 2003, pp. 6170.
[71] V. Baumgarte, G. Ehlers, F. May, A. Nückel, M. Vorbach, and M. Weinhardt, PACT XPP - A self-recongurable data processing architecture,
Journal of Supercomputing,
vol. 26, no. 2, pp. 167184, 2003.
[72] D. Kissler, A. Strawetz, F. Hannig, and J. Teich, Power-ecient Reconguration Control in Coarse-Grained Dynamically Recongurable Architectures, in
166
Proceedings of the 18th International Workshop on Power
Bibliography
and Timing Modeling, Optimization, and Simulation (PATMOS'08),
Lecture Notes in Computer Science (LNCS), vol. 5349.
ser.
Lisbon, Portugal:
Springer, Sep. 2008, pp. 307317.
[73] D.
Kissler,
A.
Strawetz,
F.
Hannig,
and
J.
Teich,
Power-ecient
Reconguration Control in Coarse-grained Dynamically Recongurable
Architectures,
Journal of Low Power Electronics,
vol. 5, pp. 96105,
2009. [Online]. Available: http://www.ingentaconnect.com/content/asp/
jolpe/2009/
[74] J. M. Arnold, D. A. Buell, D. T. Hoang, D. V. Pryor, N. Shirazi, and M. R.
IEEE International
Conference on Computer Design: VLSI in Computers and Processors,
Thistle, The Splash 2 Processor and Applications, in
1993, pp. 482485.
[75] Celoxica RCHTX Accelerator Card, Online: Celoxica Ltd., http://www.
celoxica.com/technology/accelerator.html, 2008.
[76] T. J. Callahan, J. R. Hauser, and J. Wawrzynek, The Garp Architecture
and C Compiler,
IEEE Computer,
vol. 33, no. 4, pp. 6269, Apr 2000.
[77] T. Miyamori and K. Olukotun, REMARC: Recongurable Multimedia
Proceedings ACM International Symposium on
Field-Programmable Gate Arrays, 1998, pp. 261270.
Array Coprocessor, in
[78] H. Singh, M. H. Lee, F. J. K. G. Lu, N. Bagherzaden, and E. M. C. Filho,
MorphoSys: An Integrated Recongurable System for Data-Parallel and
Computation-Intensive Applications,
IEEE Transactions on Computers,
vol. 49, pp. 465481, May 2000.
[79]
S5500 Data Sheet, Datasheet 5500-0001-000, Rev. 1.1,
Stretch Inc., 2005.
[80] R. D. Wittig and P. Chow, OneChip: An FPGA Processor With Recon-
Proceedings IEEE Symposium on FPGAs for Custom
Computing Machines, 1996, pp. 126135.
gurable Logic, in
[81] M. J. Wirthlin and B. L. Hutchings, A Dynamic Instruction Set Com-
Proceedings of the IEEE Symposium on FPGAs for Custom
Computing Machines (FCCM), 1995, pp. 99107.
puter, in
167
Bibliography
[82] S. Hauck, T. W. Fry, M. M. Hosler, and J. P. Kao, The Chimaera Re-
Proceedings of the IEEE Symposium
on Field-Programmable Custom Computing Machines (FCCM), 1997, pp.
congurable Functional Unit , in
206217.
[83] S. Vassiliadis, G. Gaydadjiev, and G. Kuzmanov, The MOLEN polymorphic processor,
IEEE Transactions on Computers,
vol. 53, no. 11, pp.
13631375, 2004.
[84]
Virtex-II Pro and Virtex-II Pro X FPGA User Guide V4.0,
Xilinx Inc.,
2005.
[85]
MicroBlaze Processor Reference Guide V5.3,
Xilinx Inc., 2006.
[86] Nios II Processor, Online: Altera Corp., http://www.altera.com/nios/,
2008.
[87] CoreMP7 soft ARM7 processor , Online: Actel Corp., http://www.actel.
com/products/ARMinFusion/, 2008.
[88] J. E. Vuillemin, P. Bertin, M. S. D. Roncin, H. H. Touati, and P. Boucard, Programmable active memories:
age,
recongurable systems come of
IEEE Transactions on Very Large Scale Integration (VLSI) Systems,
vol. 4, pp. 5669, Mar. 1996.
[89] D. T. Hoang, Seachring genetic databases on Splash 2, in
shop on FPGAs for Custom Computing Machines,
IEEE Work-
1993, pp. 185191.
[90] B. R. Chen Chang, Kimmo Kuusilinna and R. W. Brodersen., Implementation of BEE: a real-time large-scale hardware emulation engine, in
Proceedings of the 2003 ACM/SIGDA eleventh International Symposium
on Field Programmable Gate Arrays, 2003.
[91] C. Chang, J. Wawrzynek, and R. W. Brodersen, BEE2: A highend recongurable computing system,
IEEE Design and Test,
vol. 22, no. 2,
pp. 114125, 2005.
[92] J. Wawrzynek, Adventures with a recongurable research platform, in
Proceedings of the 17th International Conference on Field Programmable
Logic and Applications, 2007, pp. 34.
168
Bibliography
[93] J. Wawrzynek, M. Oskin, C. Kozyrakis, D. Chiou, D. A. Patterson, S.-L.
Lu, J. C. Hoe, and K. Asanovic, RAMP: A Research Accelerator for Multiple Processors, UCB/EECS-2006-158, EECS Department, University of
California, Tech. Rep., 2006.
[94] A. Krasnov, A. Schultz, J. Wawrzynek, G. Gibeling, and P.-Y. Droz,
Proceedings of the 17th International Conference on Field Programmable Logic
and Applications, 2007, pp. 5461.
RAMP Blue: A Message-Passing Manycore System In FPGAs, in
[95]
Application Notes 151. Virtex Series Conguration Architecture User
Guide, Xilinx Inc., 2000.
[96] A. Ahmadinia, C. Bobda, S. Fekete, J. Teich, and J. van der Veen, Optimal routing-conscious dynamic placement for recongurable devices, in
Proceedings of International Conference on Field-Programmable Logic and
Applications, ser. Lecture Notes in Computer Science (LNCS), vol. 3203.
Antwerp, Belgium: Springer, Aug. 2004, pp. 847851.
[97] A. Ahmadinia, C. Bobda, S. Fekete, J. Teich, and J. van der Veen, Optimal free-space management and routing-conscious dynamic placement
for recongurable computing,
IEEE Transactions on Computers, vol. 56,
no. 3, pp. 673680, 2007.
[98] K. Bazargan, R. Kastner, and M. Sarrafzadeh, Fast template placement
for recongurable computing systems,
ers,
IEEE Design and Test of Comput-
vol. 17, no. 1, pp. 6883, 2000.
[99] C. Bobda, M. Majer, A. Ahmadinia, T. Haller, A. Linarth, J. Teich, S. P.
Fekete, and J. van der Veen, The Erlangen Slot Machine: A Highly Flex-
Proceeding IEEE Symposium on Field-Programmable Custom Computing Machines, 2005, pp.
ible FPGA-Based Recongurable Platform, in
319320.
[100] C. Bobda, M. Majer, A. Ahmadinia, T. Haller, A. Linarth, and J. Teich,
Increasing the Flexibility in FPGA-Based Recongurable Platforms: The
Erlangen Slot Machine, in
Proceedings of the IEEE Conference on Field-
Programmable Technology,
Singapore, Singapore, Dec. 2005, pp. 3742.
[101] J. Angermeier, D. Göhringer, M. Majer, J. Teich, S. P. Fekete, and J. V.
der Veen, The Erlangen Slot Machine - A Platform for Interdisciplinary
169
Bibliography
Research in Dynamically Recongurable Computing,
nology,
Information Tech-
vol. 49, pp. 143148, 2007.
[102] Y. Krasteva, A. Jimeno, E. Torre, and T. Riesgo, Straight method for
reallocation of complex cores by dynamic reconguration in Virtex II FP-
Proceedings of the 16th IEEE International Workshop on Rapid
System Prototyping, Montreal, Canada, Jun. 2005, pp. 7783.
GAs, in
[103] M. Majer, J. Teich, A. Ahmadinia, and C. Bobda, The Erlangen Slot Machine: A Dynamically Recongurable FPGA-Based Computer,
of VLSI Signal Processing Systems,
[104]
[105]
Journal
vol. 46, pp. 1531, Mar. 2007.
Xilinx Application Notes 151: Virtex Series Conguration Architecture
User Guide, online, Xilinx Inc., http://www.xilinx.com, Xilinx Inc., 2003.
Application Notes 290. Two Flows for Partial Reconguration: Module
Based or Dierence Based, Xilinx Inc., 2004.
[106] P. Lysaght, B. Brandon Blodget, J. Mason, J. Young, and B. Bridgeford, Enhanced architectures, design methodologies and cad tools for
Proceedings of 16th International Conference on Field Programmable Logic and Applications
(FPL06), Madrid, Spain, Aug. 2006, pp. 16.
dynamic reconguration of Xilinx FPGAs, in
[107] A. Ahmadinia, J. Ding, C. Bobda, and J. Teich, Design and implementation of recongurable multiple bus on chip (RMBoC), University
of Erlangen-Nuremberg, Department of CS 12, Hardware-Software-CoDesign, Tech. Rep. 02-2004, Nov. 2004.
[108] S. Fekete, J. van der Veen, M. Majer, and J. Teich, Minimizing com-
Proceedings of 16th
International Conference on Field Programmable Logic and Applications
(FPL06), Madrid, Spain, Aug. 2006.
munication cost for recongurable slot modules, in
[109] H. A. ElGindy, A. K. Somani, H. Schröder, H. Schmeck, and A. Spray,
Proceedings of the
Second International Symposium on High-Performance Computer Architecture (HPCA-2), San Jose, California, USA, Feb. 1996, pp. 108117.
RMB - a recongurable multiple bus network, in
[110] R. Vaidyanathan and J. L. Trahan,
tures and Algorithms.
170
Dynamic Reconguration: Architec-
Kluwer Academic Publishers, 2003.
Bibliography
[111] A. Ahmadinia, C. Bobda, J. Ding, M. Majer, J. Teich, S. Fekete, and
J. van der Veen, A practical approach for circuit routing on dynamic re-
Proceedings of the 16th IEEE International Workshop on Rapid System Prototyping (RSP), Montreal, Canada, June 2005,
congurable devices, in
pp. 8490.
[112] Embedded Linux Development Kit for the PowerPC Architecture, Online:
DENX Software Engineering, http://www.denx.de/wiki/DULG/
ELDK, 2008.
[113] The U-Boot Universal Bootloader, Online: http://www.denx.de/wiki/
U-Boot, 2008.
[114]
SAA7113H 9-bit video input processor, Product data sheet, Rev. 02, Philips
Semiconductors, 2005.
[115]
ESM Motherboard Schematics V1.0,
University of Erlangen-Nuremberg,
Department of CS 12, Hardware-Software-Co-Design, 2006.
[116]
ADV7125, Triple 8-Bit High Speed Video DAC, Rev. 01,
Analog Devices,
2005.
[117]
[118]
Partial Reconguration Software Users Guide: Partial Reconguration of
Virtex 4 using PlanAhead 8.1, Xilinx Inc., 2007.
PlanAhead User Guide 8.1,
online, Xilinx Inc., http://www.xilinx.com/
support/documentation/sw_manuals/PlanAhead_UserGuide.pdf ,
Xil-
inx Inc., 2007.
[119] R. Scholz, Adapting and Automating XILINX's Partial Reconguration
Recongurable Computing:
Architectures, Tools and Applications, ARC Workshop, ser. Lecture Notes
Flow for Multiple Module Implementations, in
in Computer Science, vol. 4419.
[120]
Springer, 2007, pp. 122129.
Xilinx University Program Virtex-II Pro Development System, online, Xilinx Inc., http://www.xilinx.com/products/devkits/XUPV2P.htm, 2005.
[121] R. P. Dick, D. L. Rhodes, and W. Wolf, TGFF: Task graphs for free, in
CODES/CASHE '98: Proceedings of the 6th International Workshop on
Hardware/Software Codesign. Washington, DC, USA: IEEE Computer
Society, 1998, pp. 97101.
171
Bibliography
[122] C. Bobda, A. Ahmadinia, M. Majer, J. Ding, and J. Teich, Modular video
Proceedings of the IFIP International Conference on Very Large Scale Integration, Perth, Australia,
streaming on a recongurable platform, in
Oct. 2005, pp. 103108.
[123] Rafael Gonzalez and Richard Woods,
Digital Image Processing.
Prentice
Hall, 2002.
[124] R. Polig, Modularisierung bestehender Videolter Engines aus dem Autovision Design für die Echtzeitbildverarbeitung auf der Erlangen Slot
Machine (ESM), Studienarbeit, Technische Universität München, Nov.
2007.
[125] N. Alt, TaillightEngine Design und Implementierung, Bachelor Thesis,
Technische Universität München, Aug. 2006.
[126] N. Alt, C. Claus, and W. Stechele, Hardware/Software architecture of
an algorithm for vision-based real-time vehicle detection in dark environ-
DATE '08: Proceedings of the Conference on Design, Automation and Test in Europe, Munich, Germany, 2008, pp. 176181.
ments, in
[127] K. Benkrid, S. Sukhsawas, D. Crookes, and A. Benkrid, An FPGA-
Proceedings of Field
Programmable Logic and Application, 13th International Conference (FPL
2003), Lisbon, Portugal, Sep. 2003, pp. 10121015.
Based Image Connected Component Labeller, in
[128]
Virtex-II Platform FPGAs: Complete Data Sheet,
Xilinx, Inc., 2005.
[129] S. Hanke, Entwurf und Implementierung einer Point-Rendering-Pipeline
auf einem rekonguriebaren FPGA-System,
Diplomarbeit, University
of Erlangen-Nuremberg, Department of CS 12, Hardware-Software-CoDesign, Aug. 2007.
[130] M. Levoy and T. Whitted, The Use of Points as a Display Primitive,
The University of North Carolina at Chapel Hill, Department of Computer
Sience, Tech. Rep. TR 85-022, 1985.
[131] J. P. Grossman and W. J. Dally, Point Sample Rendering, in
of the Eurographics Rendering Workshop,
[132]
The Stanford 3D Scanning Repository,
1998, pp. 181192.
http://graphics.stanford.edu/
data/3Dscanrep/, Stanford University, Aug. 2007.
172
Proceedings
Bibliography
[133] H. Pster, M. Zwicker, J. van Baar, and M. Gross, Surfels:
Elements as Rendering Primitives, in
Computer Graphics,
K. Akeley, Ed.
Surface
Proceedings of SIGGRAPH 2000,
ACM Press / ACM SIGGRAPH /
Addison Wesley Longman, 2000, pp. 335342.
[134] S. Rusinkiewicz and M. Levoy, QSplat: a multiresolution point rendering
SIGGRAPH '00: Proceedings of the 27th
annual conference on Computer graphics and interactive techniques. New
system for large meshes, in
York, NY, USA: ACM Press/Addison-Wesley Publishing Co., 2000, pp.
343352.
[135] M. Zwicker, H. Pster, J. van Baar, and M. Gross, Surface splatting,
Proceedings of the 28th annual Conference on Computer Graphics and
Interactive Techniques. ACM Press, 2001, pp. 371378.
in
[136] S. Rusinkiewicz and M. Levoy, Streaming QSplat: a viewer for networked
visualization of large, dense models, in
sium on Interactive 3D Graphics.
Proceedings of the 2001 Sympo-
New York, NY, USA: ACM Press,
2001, pp. 6368.
[137] L. Coconu and H.-C. Hege, Hardware-accelerated point-based rendering
of complex scenes, in
Rendering.
Proceedings of the 13th Eurographics workshop on
Pisa, Italy: Eurographics Association, 2002, pp. 4352.
[138] M. Botsch and L. Kobbelt, High-quality point-based rendering on mod-
Proceedings of the 11th Pacic Conference on Computer
Graphics and Applications, 2003, pp. 335343.
ern GPUs, in
[139] C. Dachsbacher, C. Vogelgsang, and M. Stamminger, Sequential point
trees, in
Proceedings of the ACM SIGGRAPH 2003.
New York, NY,
USA: ACM Press, 2003, pp. 657662.
[140] T. Weyrich, C. Flaig, S. Heinzle, S. Mall, T. Aila, K. Rohrer, D. B. Fasnacht, N. Felber, S. Oetiker, H. Kaeslin, M. Botsch, and M. Gross, A
hardware architecture for surface splatting, in
GRAPH.
Proceedings of ACM SIG-
San Diego, California, USA: ACM Press, 2007, pp. 90100.
[141] M. Botsch, A. Wiratanaya, and L. Kobbelt, Ecient high quality rendering of point sampled geometry, in
EGRW '02: Proceedings of the 13th
Eurographics workshop on Rendering.
Aire-la-Ville, Switzerland, Switzer-
land: Eurographics Association, 2002, pp. 5364.
173
Bibliography
[142] M. Sainz and R. Pajarola, Point-based rendering techniques,
of Computers & Graphics,
Proceedings
vol. 28, pp. 869879, Dec. 2004.
[143] A. Herout and P. Zemcík, Hardware Pipeline for Rendering Clouds of
Circular Points, in
Proceedings of WSCG 2005.
University of West
Bohemia in Pilsen, 2005, pp. 1722.
[144] I. Carlbom and J. Paciorek, Planar Geometric Projections and Viewing
Transformations,
ACM Comput. Surv.,
vol. 10, pp. 465502, 1978.
[145] E. Lapidous and G. Jiao, Optimal depth buer for low-cost graphics
HWWS '99: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS workshop on Graphics hardware. New York, NY, USA: ACM
hardware, in
Press, 1999, pp. 6773.
[146] M. Stamminger and G. Drettakis, Interactive Sampling and Rendering
Proceedings of the 12th Eurographics Workshop on Rendering Techniques. Springer-Verlag, 2001, pp.
for Complex and Procedural Geometry, in
151162.
ReCoNets
Design Methodology for Embedded Systems Consisting of Small Networks of Recongurable Nodes and Connections, M. Platzner, J. Teich,
[147] D. Koch, F. Reimann, T. Streichert, C. Haubelt, and J. Teich,
and N. Wehn, Eds.
Springer, Heidelberg, Feb. 2010.
[148] E. Lübbers and M. Platzner, A portable abstraction layer for hard-
Proceedings of the International Conference on FieldProgrammable Logic and Applications (FPL), 2008, pp. 1722.
ware threads, in
[149] J. Angermeier, M. Majer, J. Teich, L. Braun, T. Schwalb, P. Graf, M. Hübner, J. Becker, E. Lübbers, M. Platzner, C. Claus, W. Stechele, A. Herkersdorf, M. Rullmann, and R. Merker, Fine grain recongurable architec-
Proceedings of International Conference on Field-Programmable
Logic and Applications (FPL), 2008, p. 348.
tures, in
[150] J. Angermeier, U. Batzer, M. Majer, J. Teich, C. Claus, and W. Stechele,
Recongurable HW/SW Architecture of a Real-Time Driver Assistance
System, in
Proceedings of ARC,
2008, pp. 148158.
[151] C. Claus, W. Stechele, M. Kovatsch, J. Angermeier, and J. Teich, A
comparison of embedded recongurable video-processing architectures, in
174
Bibliography
Proceedings of the International Conference on Field-Programmable Logic
and Applications (FPL), 2008, pp. 587590.
[152] M. Rullmann and R. Merker, A Reconguration Aware Circuit Mapper
Proceedings of the International Parallel and Distributed
Processing Symposium (IPDPS), 2007, pp. 18.
for FPGAs, in
[153] F. Dittmann, E. Weber, and N. Montealegre, Implementation of the
Proceedings of the 17th ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays (FPGA), 2009, p. 282.
reconguration port scheduling on the Erlangen Slot Machine, in
[154] F. Madlener, S. A. Huss, and A. Biedermann, RecDEVS: A Comprehensive Model of Computation for Dynamically Recongurable Hardware
Proceedings of the 4th IFAC Workshop on Discrete-Event
System Design (DESDes'09), Oct. 2009.
Systems, in
[155] C. Bolchini, D. Quarta, and M. D. Santambrogio, SEU Mitigation for
SRAM-based FPGAs through Dynamic Partial Reconguration, in
ceedings of the 17th ACM Great Lakes symposium on VLSI,
Pro-
Stresa-Lago
Maggiore, Italy, 2007, pp. 5560.
175
Bibliography
176
Curriculum Vitae
Mateusz Majer received his diploma degree (Dipl.-Ing.) in Electrical Engineering and Computer Science from the Technische Universität Darmstadt, Germany, in September 2003.
Besides his studies, he gained industrial research
experience during an internship at PACT XPP Technologies in München (2001)
and during his diploma thesis at Lucent Technologies in Nürnberg (2003). In
October 2003 he joined the Chair of Hardware-Software-Co-Design at the University of Erlangen-Nürnberg, Germany, headed by Professor Jürgen Teich as a
researcher and PhD candidate. His main research interests include the domain
of Recongurable Computing, the ecient usage of the FPGA structures for
intra-module communication, and operating system support for partial reconguration. Moreover, Mateusz Majer has been a reviewer for several international
conferences and journals, including the IEEE Transactions on Very Large Scale
Integration Systems.
177

Download Report

The Erlangen Slot Machine An FPGA-Based Partially

Paperzz.com

Your Paperzz