(TRE): Cell Broadband Engine Optimized Real-time Ray

Terrain Rendering Engine (TRE):
Cell Broadband Engine Optimized Real-time Ray-caster
Barry Minor, IBM
11501 Burnet Rd.
Austin, TX 78758
[email protected]
Gordon Fossum, IBM
11501 Burnet Rd.
Austin, TX 78758
[email protected]
Abstract
The development of the Cell Broadband Engine
(CBE), together with the advent of commercially
available high resolution satellite images and
digital elevation models (DEM), brings new
possibilities to visualizing the world around us.
We introduce the Terrain Rendering Engine
(TRE), a client/server ray-casting system in
which the client, through a thin layer of network
interfaces, directs the CBE server to render and
deliver 3-dimensional (3D) terrain views at
interactive speed. The TRE server, optimized for
the Cell Broadband Engine Architecture, is a
ray-casting engine that can compute high
resolution, high quality 3D views of terrain in
real-time without the assistance of a Graphics
Processing Unit (GPU).
Moreover, the
flexibility of the client/server model allows quick
bring-up of the client interface on other systems.
Van To, IBM
11501 Burnet Rd.
Austin, TX 78758
[email protected]
capabilities of the CBE such a system can
achieve interactive frame rates at HDTV
resolutions.
In our application, we have two sets of input, a
digital elevation model (DEM) which consists of
a uniform grid containing elevation data and a
satellite image that contains color data for each
grid position. Figure 1.1 shows the two inputs we
have for Mt. Rainier in Washington State. The
DEM can be represented as a grey scale image
and the color satellite image can be saved as a
regular JPEG.
Keywords
Cell Broadband Engine (CBE), “Cell”,
Synergistic Processor Element (SPE), Single
Instruction Multiple Date (SIMD), Direct
Memory Access (DMA), graphics hardware, raycasting, Digital Elevation Model (DEM).
1. Introduction
The advent of commercially available high
resolution satellite imagery and USGS digital
elevation models (DEM) has greatly enhanced
our ability to visualize the world around us. In
this paper, we describe an interactive system, the
Terrain Rendering Engine (TRE) for visualizing
terrain in 3D with height data and satellite image
overlay using a software implementation of raycasting optimized for the Cell Broadband Engine
(CBE). The system implements a client/server
model where the client directs the server to
render, compress, and deliver images at
interactive speeds for remote display. We show
that by leveraging all the parallel processing
Figure 1.1
Digital Elevation Model and Color Satellite Image
In order for a traditional graphics processing unit
(GPU) to render this data it must be preprocessed
into a more digestible form. This typically
involves decomposing the height field specified
by the DEM into triangle meshes with precomputed texture and normal information per
triangle vertex. This process results in a huge
amount of data that is then transferred
redundantly, rows of shared vertices are
transferred multiple times, over an IO bus to the
GPU for rendering. Data relaxation techniques
can be used to reduce this volume of data but
they also reduce model detail. Furthermore
many of the triangles communicated to the GPU
are not even visible in the finished image. To
optimally handle this either the CPU must
remove some of the non-visible triangles via
system side view frustum culling and the rest
with GPU back facing, and depth buffering or
communicate the entire scene to the GPU each
frame and accept the degradation in performance.
Figure 1.2
TRE Ray-casting output
Rendering the terrain directly with ray-casting
[Figure 1.2] reduces the scene’s memory foot
print by removing the need to decompose the
height field into triangles and store pre-computed
normals.
Furthermore, ray-casting only
processes the visible terrain fragments, thereby
eliminating the need to cull and depth buffer
fragments. However, today’s general purpose
CPUs can’t handle such rendering tasks at
HDTV resolutions and interactive frame rates.
By leveraging the parallel processing capabilities
of the CBE, we can finally ray-cast large 3D
terrains in real-time.
This paper is organized as follows: section 2
contains background information on relevant
aspects of the Cell Broadband Engine
architecture; section 3 presents an overview of
the ray-casting algorithm used in this
implementation; section 4 gives a description of
the TRE system, including description of the
client/server model; section 5 provides detailed
description of our ray-casting implementation
and optimization on Cell; section 6 provides
detailed information on our image compression
technique; section 7 describes a technique called
SPE repurposing which we use to increase the
SPEs’ utilization; section 8 presents the TRE’s
performance results; and section 9 contains the
conclusion and future work.
Figure 2.1
Cell Broadband Engine
The key attributes of this concept are:
•
A high-frequency design, allowing the
processor to operate at a low voltage and
low power while maintaining high
performance [5].
•
Power Architecture® compatibility, to
provide a conventional entry point for
programmers, for virtualization and multiOS support, and to be able to leverage
IBM’s experience in designing and verifying
symmetric multiprocessors (SMP).
•
SIMD architecture, supported by both the
vector (VMX) extensions on the PPE and
the SPE, as one of the means to improve
game/media and scientific performance at
improved power efficiency.
•
A power and area efficient PPE that
implements the Power Architecture ISA and
supports the high design frequency.
•
SPE with local memory, asynchronous
coherent DMA, and a large unified register
file, in order to improve memory bandwidth
and to provide a new level of combined
power efficiency and performance [6][7].
•
A high-bandwidth on-chip coherent fabric
and high bandwidth memory, to deliver
performance on memory bandwidth
intensive applications and to allow for highbandwidth on-chip interactions between the
processor elements.
2. Cell Broadband Engine Overview
The Cell Broadband Engine (CBE) consists of
one dual-threaded Power Processor Element
(PPE) with eight SIMD Synergistic Processor
Elements (SPE), along with an on-chip memory
controller (MIC), and an on-chip controller for a
configurable I/O interface (BIC) interconnect
with a coherent on-chip Element Interconnect
Bus (EIB) [8].
•
High bandwidth flexible I/O configurable to
support a number of system organizations,
including a single BE configuration with
dual I/O interfaces, and a glue-less coherent
dual processor configuration.
•
A full custom modular implementation to
maximize performance per watt and
performance per square mm of silicon and to
facilitate the design of derivative products.
These features enable the CBE to excel at many
rendering tasks including ray-casting. The key
to performance on the CBE is leveraging the
SPEs efficiently by providing multiple
independent data parallel tasks that optimize well
in a SIMD processing environment.
information necessary to compute all of the
samples in that vertical cut is contained in a close
neighborhood of the intersection between the
vertical cut and the height map. Furthermore,
each sample in the vertical cut, when computed
from the bottom to top, only needs to consider
the height data found in the buffer starting at the
last known intersection.
These two
optimizations:
•
•
Greatly reduce the amount of height data
that needs to be considered when processing
each vertical cut of samples.
Continue to reduce the data that needs to be
considered as the vertical cut is processed.
4. TRE Application
3. Ray-casting Overview
The TRE uses a highly optimized height field
ray-caster to render images. Ray-casting, like
ray-tracing, follows mathematical rays of light
from the viewer’s virtual eye, through the view
plane, and into the virtual world. The first
location where the ray is found to intersect the
terrain is considered to be the visible surface.
Unlike ray-tracing which generates additional
secondary rays from the first point of intersection
to evaluate reflected and refracted light, raycasting evaluates lighting with a more
conventional approach. At this first point of
intersection a normal is computed to the surface
and a surface shader is evaluated to determine
the illumination of the sample. Ray-casting has
the performance advantage, unlike polygon
rendering, of only evaluating the visible samples.
The challenge of producing a high performance
ray-caster is finding the first intersection as
efficiently as possible. While ray-casting is
inherently a data-parallel computational problem,
the entire scene must be considered during each
ray intersection test.
The TRE is optimized to render height fields
which are 2D arrays of height values that
represent the surface to be viewed. A technique
call “vertical ray coherence [1]” was used to
optimize the ray-casting of such surfaces. This
optimization uses "vertical cuts", which are halfplanes perpendicular to the plane of the height
map, all radiating out from the vertical line
containing the eye point. Each such vertical cut
slices the view plane in a not-necessarily-vertical
line.
The big advantage is that all the
The TRE was implemented using a client server
model. The client, implemented on an Apple
G5 system, is connected to the server via a
gigabit Ethernet. The client specifies the terrain,
viewer position and view vector, and rendering
parameters to the TRE server which in turn
computes the 3D images and streams the
compressed images back to the client for display.
We have designed the interface between the
client and the CBE server to be flexible enough
so that we can replace the Apple client with any
light-weight device that has a display and a
framebuffer (i.e. wireless PDA, PlayStation
Portable).
4.1 TRE Server
The TRE server, implemented on the STIDC
CBE bring-up boards, runs on both uni-processor
(UP) and symmetric multi-processor (SMP)
mode. The server scales across as many SPEs as
there are available in the system. When the
server starts rendering a frame, one SPE is
reserved for image compression while the other
available SPE execute ray-casting kernels. Once
the frame compression is complete the reserved
SPE is repurposed for ray-casting. Three threads
execute on the available Multi-Threading (MT)
slots of the PPE. The three threads
responsibilities break down as follows:
1.
2.
3.
Frame preparation and SPE work
communication.
Network tasks involving frame delivery.
Network
tasks
involving
client
communication.
Thread one requires the most processor resource
followed by threads two and three, so threads
two and three share the same processor thread
affinity in the UP model.
The server implements a three frame deep
pipeline (Figure 4.1.2) to exploit all the
parallelism in the CBE.
Prep Frame 0
PPE
SPE 0-6
SPE 7
Prep Frame 1
Prep Frame 2
Prep Frame 3
Render Frame 0
Render Frame 1
Render Frame 2
Encode Frame 0
Encode Frame 1
Time
Figure 4.1.2
Server Image Pipeline
In stage one, the image preparation phase, PPE
thread one decomposes the view screen into
work blocks based on the vertical cuts dictated
by the view parameters. Stage two, the sample
generation phase, is where the SPEs decompose
the vertical cuts into rays, ray-cast the rays,
shade the samples, and store the samples into the
accumulation buffer. In stage three, the image
compression/network phase, the compression
SPE encodes the finished accumulation buffer
and PPE thread two delivers it to the network
layer. All three stages are simultaneously active
on different execution cores of the chip
maximizing the image throughput.
•
•
•
Server connection and selection
Map cropping
Streaming image decompression and display
5. Ray-casting on CBE
When implementing ray-casting of height fields
we use vertical ray coherence to break down the
rendering task into data parallel work blocks, or
vertical cuts of screen space samples, using the
PPE, and dispatching the blocks to ray kernels
running on the SPEs. It is key that the SPEs
work as independently as possible so as to not
over task the PPE, given the 8 to 1 ratio, and we
don’t want the SPE’s streaming data flowing
through the PPE’s cache hierarchy. These work
blocks are therefore mathematical descriptions of
the vertical cuts based on the current view
parameters. In Figure 5.1, where the green plane
intersects the gray height map are the locations
of the input height data needed to process the
vertical cut of samples. Where the green plane
intersects the blue view screen are the locations
of the accumulation buffer to be modified.
4.2 TRE Client
The TRE client directs the rendering of the TRE
server. We have picked the Apple G5 as the
platform to develop our client for aesthetic
reasons only. The client is very light-weight and
could easily be deployed on any device that has a
framebuffer and enough computational power to
receive, and decode our images. The current
TRE client provides for the following facilities:
•
•
•
•
Input data loading and delivery to the server
Smooth path generation both user directed
and random.
Joy stick directed flight simulation
Rendering parameter modification
o Sun angle
o Lighting parameters
o Fog and haze control
o Bump mapping control
o Output image size
o Number of SPEs to render
Figure 5.1
The PPE communicates the range of work blocks
each SPE is responsible for via base address,
count, and stride information once per frame.
Each SPE then uses this information to compute
the address of each work block which is then
read into local store via a DMA read and
processed. This decouples the PPE from the
vertical cut processing allowing it to prep the
next frame in parallel. The SPEs place the
output samples in an accumulation buffer using a
gather DMA read, modify, and a scatter DMA
write. Each SPE is responsible for four regions
of the screen. Vertical cuts are processed in a
round robin fashion, one vertical cut per region,
and left to right within a each region. No
synchronization is therefore needed on the
accumulation buffer read/modify/write as no two
SPEs will ever attempt to modify the same
locations in the accumulation buffer even with
two vertical cuts in flight (double buffering) per
SPE. The SPEs are free to run at their own pace
processing each vertical cut without any data
synchronization on the input or output. None of
the SPE’s input or output data is touched by the
PPE, thereby protecting the PPE’s cache
hierarchy from transient streaming data.
finds each intersection for rays in the vertical cut
thereafter.
Ray
Ray Intersection
Intersection
Shader
Shader
Texture
Texture Filtering
Filtering
Normal
Normal Generation
Generation
5.1 Ray Kernel
Bump
BumpMapping
Mapping
The SPE ray kernel is responsible for
decomposing each vertical cut into rays,
computing
the
ray/terrain
intersections,
evaluating the surface shader at each intersection,
and updating the accumulation buffer with the
new samples.
Lighting
Lighting
Atmosphere
Atmosphere
Sample
Sample Accumulation
Accumulation
This process begins with a command parser that
looks for work in the mailbox and dispatches the
proper code fragment.
The implemented
commands include:
•
•
•
Multi-Sample
Multi-Sample Adjustment
Adjustment
Update Rendering Context
Render Vertical Cuts
Repurpose to Encoder Kernel
The “Update Rendering Context” command
signals the SPE to DMA in a new block of
rendering state which includes such information
as shader parameters for the lighting model and
atmospheric effects. This command can be
issued on a per frame basis as the “Render
Vertical Cuts” command describes a frame’s
worth of vertical cuts to each SPE. The SPE
starts the rendering process by issuing a DMA to
bring in the first work block describing a vertical
cut. DMA lists are then computed to load the
required height/color and accumulation data.
Once the data is present in local store the
rendering loop is executed which decomposes
each vertical cut into rays, finds the ray/terrain
intersections, evaluates the shader at each
intersection, and updates the accumulation buffer
with new samples.
The two blocks of code that account for the
majority of ray kernel cycles are the intersection
test and shader. The intersection test is broken
into two phases, the first searches for the initial
ray’s intersection and the second more carefully
Figure 5.1.1
Ray Kernel Sample Generation
The shader provides both terrain illumination
and atmospheric effects. The current shader
contains the following functionality:
•
•
•
•
•
•
•
•
Texture filtering via a two by two color
neighborhood.
Surface normal computation via four by four
height neighborhood
Bump mapping via a normal perturbation
function [3]
Surface illumination via diffuse reflection
and ambient lighting model.
Visible sun plus halo effects
Non-linear atmospheric haze
Non-linear ground fog
Resolution independent Clouds computed
via multiple octaves of Perlin noise
evaluated on the fly [4].
Dynamic multi-sampling is implemented by
adjusting the distance between rays along the
vertical axis (vertical cut) based on the under or
over sampling of the prior four rays [2]. The
sampling rate along the horizontal axis is a
function of the size of the image, and is
computed to ensure that no pixel is skipped over,
even after accounting for round off error.
The compute and data fetch phases of the ray
kernel execute in parallel by exploiting the
asynchronous execution of the SPE’s DMA
engine (SMF) and SIMD core (SPU). Multiple
input and output buffers are used to decouple the
two phases. The ray kernel processes the data
using double buffered inputs and outputs so as to
cover up the memory latency and data load time.
Height/color and accumulation data are brought
in by the SMF for vertical cut n+1 while samples
are computed by the SPU for vertical cut n. This
results in the SPU waiting for DMA transferred
memory less than 1% of its execution time.
Texture color data is interleaved together with
height data in system memory, 16 bits of color
(5/6/5) and 16 bits of height. The surface shader
requires a four by four height neighborhood to
compute the surface normal and a two by two
neighborhood for texture filtering. The
height/color DMA list gathers blocks of four
height/color pairs along the intersection of the
vertical cut plane and the height/color plane
[Figure 5.1.2].
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
Shuffle Byte
H C
H C
H C
H C
H C
H C
H C
Quad Word (128
Figure 5.1.3
SPE Unaligned Quad Word Load
Samples are computed in a SIMD fashion by
searching for four intersections at a time [Figure
5.1.4] and when all four are located they are then
evaluated in parallel by the surface shader. Rays
are packed four at a time into the single precision
floating point channels of each vector register.
All ancillary information need for the shader is
packaged in the same format using additional
vector registers. When an intersection is found
for a ray the channel containing the ray is locked
for all registers participating in the rendering
operations using select instructions.
Ray 1
Ray 2
Ray 3
Ray 4
Vector (128 bits)
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
H C
Quad Word (128
Quad Word (128
Figure 5.1.2
Height Color Data Memory Layout
These 128 bit (quad word) quantities are word
aligned in system memory and therefore can’t be
directly fetched by the SPE’s quad word
optimized DMA engine. To handle this situation
the height/color DMA list is constructed to fetch
256 bits of data starting at the naturally aligned
quad word address of the desired word aligned
128 bit quantity. Once the data has been
gathered into local store, pairs are run through
the shuffle byte engine to reconstruct the desired
128 bit quantity for the shader [Figure 5.1.3].
4
3
2
1
Figure 5.1.4
SIMD Ray-casting
The resulting IBM XLC compiled C code
occupies 29KB of local store and uses 204KB of
buffer space plus 4KB of stack [Figure 5.1.5].
Height/color (HC) DMA lists are implemented in
a destructive fashion where the list is constructed
in the return data buffer and is written over, or
destroyed, as the DMA list is executed and data
is returned to the buffer. The accumulation
buffer operations are read/modify/write so the
DMA lists used to gather/read the data are
preserved and re-executed to scatter/write the
modified data back to the accumulation buffer.
Height Color Data
(57 KB)
Height Color Data
(57 KB)
HC DMA List (13 KB)
HC DMA List (13 KB)
Accumulation Data
(30 KB)
Accumulation Data
(30 KB)
Accum DMA List (15 KB)
Accum DMA List (15 KB)
R
R
AC
Stack 4 KB
Code
(29 KB)
Figure 5.1.5
Ray Kernel Local Store Memory Allocation
6. Image Compression
In addition to the SPEs running the ray-kernels
one SPE is repurposed during the frame time for
image compression. The compression kernel
operates on the finished accumulation buffer
which is organized in a column major 2D array
of single precision floating point (red, green,
blue, count) data, one float per channel (128 bits
per pixel). The count channel contains the
number of samples accumulated in the pixel. The
image compression kernel is responsible for the
following tasks:
•
•
•
Normalization of the accumulation buffer
Compression of the normalized buffer.
Clearing of the accumulation buffer
The compression kernel reads sixteen by sixteen
pixel tiles from the accumulation buffer into
local store using DMA lists. These tiles are then
normalized by dividing the color channels by the
sample count for each pixel. The tile is then
compressed using a multi-stage process
involving color conversion, discrete cosine
transformation (DCT), quantization, and run
length encoding. The resulting compressed data
is then written to a holding buffer in local store
which when full is DMA written back to system
memory for network delivery. As each tile is
processed a tile of zeros is returned to the
accumulation buffer providing the clear
operation.
7. SPE Repurposing
Repurposing is a way to transfer unused cycles
from an idle SPE kernel type to an active kernel
type during a given frame. Each TRE SPE
kernel includes the capability to repurpose itself
to the other kernel type. For example, if a
“repurpose” command is issued to a SPE running
the ray kernel it will finish its current ray kernel
work, load the encoder kernel code, and begin
encoder execution. Any commands behind the
“repurpose” command in the mailbox will be
interpreted as encoder kernel commands. The
repurpose SPE thread command is implemented
as a list of programs passed to the thread upon
creation. When a “repurpose” command is
issued the SPE thread returns execution to a
special SPE crt0 that DMAs in the next programs
code and executes the new main. This provides
a very light weight application controlled SPE
context switch that, in our case, can be issued
multiple times per frame giving us the flexibility
to use any SPE as our encoder. Since the TRE
server dynamically distributes ray-casting work
across all the SPEs based on which SPE finishes
first and which SPE finishes last, the repurposed
SPE gets less work when the encoding load is
larger (high frequency image) and more work
when the encoding load is smaller (low
frequency image). Work rebalancing takes place
after each frame so the work given to the
repurposed SPE is rarely out of balance.
8. TRE Performance
The TRE code base was initially implemented on
Apple G5 systems using SIMD VMX engines.
This provided a clean porting path to the CBE as
both the SPE and the G5’s VMX engine have
similar instruction sets. The SIMD optimization
effort on the G5 resulted in a 3.25x speedup over
the scalar code base showing that we were in fact
computationally bound on the G5.
The CBE PPE is used only as a control processor
in the TRE and therefore has little impact on the
TRE’s overall performance.
The overall
programming goal for the TRE PPE threads was
to not have them touch any of the SPEs
streaming data and provide little if any of the
computational load. Only a basic tuning effort
was therefore expended on this code.
The SPE threads on the other hand were highly
optimized for computational efficiency, data
throughput, and code/data footprint. All of our
performance gains are the direct result of the
heavy lifting but simple SPE processors. The
SPE ray kernel was originally written for the G5
using only a subset of the VMX intrinsics that
mapped directly to the SPE. This allowed us to
verify the algorithms at full speed and then
directly port them to the SPE with minimal effort.
Performance analysis was then performed on the
CBE implementation using our system simulator.
Hot spots were identified and aggressively
optimized using loop unrolling, branch removal
via select operations, immediate field instruction
encodings, code interleaving, and SPE
instruction pipeline rebalancing.
Functional
blocks found not to be performance critical were
squeezed the opposite direction for reduced code
footprint. Since the SPE is a dual issue machine
the theoretical optimum cycles per instruction
(CPI) is 0.5. The optimized ray kernel executes
with a CPI range of 0.7 to 0.8 and occupies only
29KB of local store space.
The volume of scattered data gathered into SPE
local stores created a problem for the SPE
transition look aside buffer (TLB). We found
that with 4KB pages DMAs were bound by TLB
misses as a result of ray kernel DMA lists
accessing very large quantities of pages. All
streaming data was placed in 16MB pages to
alleviate this problem.
Since we wrote an optimized VMX version of
our server for the Apple G5 as part of our
development process we thought it would be
interesting to compare the performance
differences of this rendering task. We didn’t
implement an encoder for the G5 renderer since
it ran as part of our client and therefore had
direct access to the framebuffer.
In our
benchmark the Apple has a distinct advantage
given it only has to render and display the
uncompressed frames. The CBE server, on the
other hand, has to render, compress, and deliver
each frame. Even with this handicap Cell
outperforms the local G5 implementation by a
factor of 50x.
The server has many rendering parameters that
affect performance including:
•
•
•
•
Output image size
Map size
Visibility to full fog/haze
Multi-sampling rate
For benchmarking purposes the following values
were selected:
•
•
•
•
1280x720 (720p) output image size
7455x8005 Map size
2048 map steps to full haze
1.33 x (2 – 8 Dynamic) or ~2-32 samples per pixel
With these settings the following rendered frame
rates were measured:
System / frames per second
2.0 GHz Apple G5 1GB memory (no image encode)
3.2 GHz UP Cell 512MB memory
3.2 GHz 2-way SMP Cell 1GB memory
0.6
30
58
9. Conclusion and Future Work
In this paper, through the TRE system, we have
outlined how a compute intensive graphics
technique, ray-casting, can be decomposed into
data parallel work blocks and executed
efficiently on multiple SIMD cores within one or
multiple Cell Broadband Engines. We have also
shown that by pipelining the work block
preparation, execution, and the delivery of
results, additional parallelism within the CBE
can be exploited. One order of magnitude
performance improvements over similarly
clocked single threaded processors can be
achieved when computational problems are
decomposed and executed in this manner on the
CBE. Graphics techniques once considered
offline tasks can now be executed at interactive
speeds using a commodity processor.
The TRE framework and the CBE present many
avenues for future works:
•
•
•
Embed the computed depth information for
each pixel in the returned framebuffer. By
preloading this depth information into the zbuffer, we can combine the ray-cast scene
with GPU rasterized objects.
Level of Detail rendering.
Interactive modifications of the terrain. This
will enable us to simulate shock waves, add
fluid simulation to the lakes, avalanches to
the snow, and other simulations that require
real-time change in terrain geometry.
References
[1] Cheol-Hi Lee, “A Terrain Rendering Method
Using Vertical Ray Coherence”, The Journal of
Visualization and Computer Animation
[2] Cohen-Or, D., Rich, E. Lerner, U., and
Shenkar, V., “A real-time photo-realistic visual
flythrough,” IEEE® Trans, Vis. Comp. Graph.,
Vol 2, No 3, September 1996.
[3] Blinn, "Simulation of Wrinkled Surfaces",
Computer Graphics, (Proc. Siggraph), Vol. 12,
No. 3, August 1978, pp. 286-292.
[4]
Perlin,
Ken,
"Making
http://www.noisemachine.com/talk1/
noise,"
[5] H. Peter Hofstee, “Power Efficient Processor
Architecture and the Cell Processor”, to appear,
Proceedings 11th Conference on High
Performance Computing Architectures, Feb.
2005.
[6] B. Flachs, S. Asano, S. H. Dhong, H. P.
Hofstee. G. Gervais, R. Kim, T. Le, P. Liu, J.
Leenstra, J. Liberty, B. Michael, H-J. Oh, S. M.
Mueller, O. Takahashi, A. Hatakeyama, Y.
Watanabe, N. Yano. “The Microarchitecture of
the Streaming Processor for a CELL Processor,”
IEEE
International
Solid-State
Circuits
Symposium, Feb. 2005.
[7] T. Asano, T. Nakazato, S. Dhong, A.
Kawasumi, J. Silberman, O. Takahashi, M.
White, H. Yohsihara, “A 4.8GHz Fully Pipelined
Embedded SRAM in the Streaming Processor of
a CELL processor”, IEEE International SolidState Circuits Symposium, Feb. 2005.
[8] D. Pham, S. Asano, M. Bolliger, M. N. Day,
H. P. Hofstee, C. Johns, J. Kahle, A. Kameyama,
J. Keaty, Y. Masubuchi, M. Riley, D. Shippy, D.
Stasiak, M. Suzuoki, M. Wang, J. Warnock, S.
Weitzel, D. Wendel, T. Yamazaki, K. Yazawa,
“The Design and Implementation of a FirstGeneration
CELL
Processor,”
IEEE
International Solid-State Circuits Symposium,
Feb. 2005.
© IBM Corporation 2005
IBM Corporation
Systems and Technology Group
Route 100
Somers, New York 10589
Produced in the United States of America
May 2005
All Rights Reserved
This document was developed for products and/or services
offered in the United States. IBM may not offer the products,
features, or services discussed in this document in other
countries.
The information may be subject to change without notice.
Consult your local IBM business contact for information on
the products, features and services available in your area.
All statements regarding IBM future directions and intent are
subject to change or withdrawal without notice and represent
goals and objectives only.
IBM, the IBM logo, Power Architecture, are trademarks or
registered trademarks of International Business Machines
Corporation in the United States or other countries or both.
A full list of U.S. trademarks owned by IBM may be found
at:
http://www.ibm.com/legal/copytrade.shtml.
IEEE and IEEE 802 are registered trademarks in the United
States, owned by the Institute of Electrical and Electronics
Engineers. Other company, product, and service names may
be trademarks or service marks of others.
Photographs show engineering and design models. Changes
may be incorporated in production models. Copying or
downloading the images contained in this document is
expressly prohibited without the written consent of IBM
All performance information was determined in a controlled
environment. Actual results may vary. Performance
information is provided “AS IS” and no warranties or
guarantees are expressed or implied by IBM
THE INFORMATION CONTAINED IN THIS
DOCUMENT IS PROVIDED ON AN "AS IS" BASIS. In no
event will IBM be liable for damages arising directly or
indirectly from any use of the information contained in this
document.