Virtex Architecture NDA Presentation

Redefining the FPGA
The first fully programmable system solution
designed specifically for intellectual property.
Agenda
 Technology Roadmap
 Redefining the FPGA
 Architecture Overview

The CLB Tile, Vector Based Interconnect, Internal Bus
Support, SelectRAM+, Clocking & DLLs, SelectI/O,
Thermal Management & The SelectMap Interface
 Software & Cores Support
 Summary - A System Level Solution
Technology Roadmap
Virtex
Density/Performance
1 Million+ System Gates with
High Performance System Solution
5LM - 0.25µm (7LM - 0.18µm)
XC4000XV
3LM - 0.25µm
(XC40250XV)
XC4000XL
3LM - 0.35µm
(XC4085XL)
XC4000E
2LM - 0.5µm
(XC4025E)
1995
XC4000EX
2LM - 0.5µm
(XC4036EX)
1996
1997
1998
1999
Redefining the FPGA
Chip 1
Chip 2
133MHz SDRAM
3
1x CLK
SRAM Cache (Mbytes)
2x CLK
LVCMOS
SSTL3
4
LVTTL
GTL+
1
2
Low Voltage
CPU
High Speed System Backplane
"Virtex moves FPGAs from
glue to system component”
Redefining the FPGA
2
System
Integration
1
4
System Memory
3
System
Timing
System Interfaces
Value Extends Beyond the Socket
Redefining the FPGA
Advanced Process Technology Allows for
Almost 10x the Density of Today’s FPGAs
System
Integration
Extremely Dense
2ns
1
2ns
1,728 to 27,648 Logic Cells
Predictable Routing Delays Produce
a Core Friendly Architecture
With Fast Place & Route Times
Redefining the FPGA
2
System Memory
200 MHz Distributed SelectRAM
200 MHz Block SelectRAM
RAMB4_S4_S16
200 MHz Access to External Memory
WEA
ENA
RSTA
CLKA
ADDRA[9:0]
DIA[3:0]
DOA[3:0]
WEB
ENB
RSTB
CLKB
ADDRB[7:0]
DIB[15:0]
DOB[15:0]
Redefining the FPGA
CLKDLL
CLKIN
CLKFB
CLK0
CLK90
CLK180
CLK270
90 MHz
DLL
CLK2X
CLKDV
LOCKED
RST
3
CLK
DLL
DLL
Virtex
Route to Other Devices
45 MHz
(Divide by 2)
180 MHz
(Multiply by 2)
System
Timing
Redefining the FPGA
5.0V
1.8V
PCI
3.3V
2.5V
SelectI/O Allows Connection
Directly to External Signals of
Varied Voltages & Thresholds
SSTL HSTL
Future Standards Can be
Supported Without Having
to Make Silicon Changes
4
GTL
System Interfaces
GTL+ AGP
Redefining the FPGA
1 System Integration

Intellectual Property is Critical for High Density Design &
Must Drop in Easily Without Penalty Across an Entire Family
2 System Memory


Memory Bandwidth is Always Key
Size & Depth Requirements Vary Depending on the
Application
3 System Timing


Chip to Chip Performance Typically Limits System Speeds
Clock Skew is an Important Factor in High Performance
Systems
4 System Interfaces


Process Technology Leads to Mixed Voltage Systems
High performance, Lower Power Signal Standards Have
Emerged
Redefining the FPGA
New Modules
IP Modules
AllianceCore
133Mhz
SDRAM
VHDL Design
Environment
Verilog Design
Environment
Designer #1
Designer #2
CoreGen
DSP
FIFO
Design
Reuse
Giga-bit
Ethernet
CPU
LogiCore
66Mhz
PCI
Virtex
160 MHz I/O
133 MHz Memory
1 Million+
System Gates
Redefining the FPGA
 Extremely Dense


50,000 to 1,000,000 System
Gates
1,728 to 27,648 Logic Cells
 System Performance & Features



160 MHz+ System Performance
Multiple DLLs & Block
SelectRAM
Supports Multiple I/O Standards
IP
Software
 Internal Performance & Features
System
Building Blocks
100 MHz+ at 3 to 4 Logic Levels
TBUFs & Distributed SelectRAM
Fast, Flexible I/Os


 Superior Intellectual Property
Infrastructure - CoreGen & Web
Segmented Routing
4-Input LUT Architecture
Leading Edge Process Technology
 Proven Software Flows for High
Density & Performance - M1.5 The World’s First Fully Programmable System-Level Architecture
Architecture Overview
2ns
RAMB4_S4_S16
2ns
WEA
ENA
RSTA
CLKA
ADDRA[9:0]
DIA[3:0]
DOA[3:0]
WEB
ENB
RSTB
CLKB
ADDRB[7:0]
DIB[15:0]
DOB[15:0]
2
1
Block SelectRAM
The CLB Tile
Thermal Management
SelectMAP
Configuration
Distributed SelectRAM
CLKDLL
GTL
CLK0
GTL+
AGP
CLK90
CLKIN
CLKFB
CLK180
CLKDV
LOCKED
3
DLL
1.8V
3.3V
2.5V
CLK270
CLK2X
RST
5.0V
PCI
SSTL
4
SelectI/O
HSTL
The CLB Tile
Advanced Process Technology Allows for
Almost 10x the Density of Today’s FPGAs
System
Integration
Extremely Dense
2ns
1
2ns
1,728 to 27,648 Logic Cells
Predictable Routing Delays Produce a
Core Friendly Architecture With
Much Faster Place & Route Times
 All CLB Inputs Have Access
to Interconnect on All 4
Sides
INTERNAL BUSSES
CARRY
CARRY
SINGLE
HEX
 CLB Tile is Composed of a
Switch Matrix, Configurable
Logic Block, and Associated
General Routing Resources
LONG
The CLB Tile
TRISTATE BUSSES
LONG
LONG
HEX
HEX
SWITCH
MATRIX
SINGLE
SINGLE
SLICE
DIRECT
CONNECT
Local
Feedback
 Slices Have a Bit Pitch of 2
CLB
CARRY
 Fast Local Feedback Within
the CLB & Direct Connects
to Adjacent Horizontal
Neighbors
SLICE
CARRY
SINGLE
DIRECT
CONNECT
HEX
 Wide Single CLB Functions
LONG
 CLB is Divided into Two
Identical Slices
Simplified CLB Structure
CLB
Slice
LUT
Slice
Carry
PRE
D
Q
CE
LUT
Carry
CLR
LUT
Carry
PRE
D
Q
CE
CLR
PRE
D
Q
CE
CLR
LUT
Carry
PRE
D
Q
CE
CLR
2 Slices in Each CLB


Virtex Slice is Similar in Contents to the Current XC4000 CLB
2 BUFTs Associated with Each CLB, Accessible by All 8 CLB Outputs
Detailed Slice Structure
COUT
G1
G2
G3
G4
A1
A2
A3
A4
O
WS
DI
YB
1
LUT/RAM/ROM/SHIFT
0 1
Y
*
0
1
D
BY
S
Q
YQ
CE
CLK
R
Write
Strobe
Logic
Data In
Multiplex
Logic
CE
SR
GSR
F5 from
other slice
XB
Position of
F5 tap on
other slice
WS
A1
A2
A3
A4
F1
F2
F3
F4
DI
1
0 1
X
O
LUT/RAM/ROM/SHIFT
*
D
0
1
S
Q
XQ
CE
R
* Controlled by the same pair of memory cells
** Implemented as extra inputs on the BX input mux
*** CLK and SR inputs are common to both slices
BX
1 0
CIN
Wide Single CLB Functions
2.5ns
CLB
Slice
Slice
0.3ns
1.1ns
1.1ns
LUT
LUT
Implement 13-Input Functions in a Single CLB


Builds on XC4000 Architecture 9-Input Function
2 Logic Levels and 1 Local Interconnect Yield a 2.5ns Max Delay
Slice Features
 Two 4-Input LUTs in Each Slice
 Includes 2 Highly Flexible Sequential Elements
 Dedicated Logic for 4x1 & 8x1 Muxes
 Fast Look Ahead Carry Logic
 Dedicated Multiplier Fabric
 New SelectShift Feature

Create Shift Registers up to 16 Cycles Deep in a Single 4Input LUT
 4-Input LUTs can be used as Distributed SelectRAM

Same as XC4000 Synchronous Modes - Single & Dual Port
Flexible Sequential Elements
 Sequential Elements Can be
Flip-flops or Latches
FDRSE
D
S
CE
 2 in Each Slice, 4 in Each CLB
 Can be Sourced from LUTs or
an Independent CLB Input
 Separate Set & Reset Controls
 Controls Can be
Synchronous or
Asynchronous
 GSR Can be Used for
Power On Set/Reset
 All Controls Can be Inverted
 Controls are Shared Within
Each Slice
Q
R
FDCPE
D PRE
Q
CE
CLR
LDCPE
D PRE
CE
G
CLR
Q
Fast Efficient Muxes
 Primary Use of XC4000 HMAP
was to Implement a 2x1 Mux
 Dedicated Muxes are Faster &
More Space Efficient
 Space Freed Up is Used
for Muxes & Other Special
Logic
 MUXF5 Can be Used to
Combine the Two LUTs in a
Slice to Create a 4x1 Mux or
Any Function of 5 Inputs
CLB
Slice
LUT
MUXF6
LUT
MUXF5
Slice
LUT
LUT
 MUXF6 Can be Used to
Combine the Two Slices in a
CLB to Create an 8x1 Mux or
Any Function of 6 Inputs
MUXF5
Fast Look Ahead Carry Logic
0
1
LUT
0
1
LUT
0
1
LUT
0
1
LUT
Simple, Fast & Complete Arithmetic Logic




Vertical, Up Only Carry Direction
Look Ahead Carry Implementation Yields 32-Bit Counters &
Arithmetic Functions that Perform at 100MHz+
Discrete XOR Component for Single Level Sum Completion
2 Separate Carry Chains in CLB Allow for 3 Operand Functions
Dedicated Multiplier Fabric
LUT
A
CY_MUX
CO
S
DI
CI
CY_XOR
MULT_AND
AxB
LUT
B
LUT
Highly Efficient ‘Shift & Add’ Implementation


Logic Added for Implementation of Binary Tree Style Multipliers
30% Reduction in Area for a 16x16 Multiply & 1 Less Logic Level
SelectShift
Dynamically Addressable Shift
Registers - DASRs





LUT
Ultra-Efficient Programmable Clock
Cycle Delay
Serial In, Serial Out, Clock, Clock
Enable, and Shift Depth Address
Single LUT Maximum Cycle Delay
of 16
Cascade DASRs for Cycle Delays
Greater than 16
CLB Flip-Flops Can be Used for
Other Functions or to Add to DASR
Depth
IN
CE
CLK
D
Q
CE
D
Q
CE
D
Q
CE
CLB
Slice
Slice
LUT
LUT
LUT
LUT
D
Q
CE
DEPTH[3:0]
OUT
SelectShift
12 Cycles
64
Operation A
Operation B
4 Cycles
8 Cycles
64
Operation C
3 Cycles
9-Cycle Imbalance
3 Cycles
 Register Rich FPGAs Allow for the Addition of
Pipeline Stages to Increase Throughput
 Data Paths Must be Balanced to Maintain Desired
Functionality
SelectShift
12 Cycles
64
Operation A
Operation B
4 Cycles
8 Cycles
Operation C
Operation D - NOP
3 Cycles
9 Cycles
64
Paths Statically
Balanced
12 Cycles
 SelectShift Feature of the 4-Input LUT Can be
Used to Create NOPs

Above Example Uses 64 LUTs to Replace 576 Flip-flops (64*9)
SelectShift
(continued)
12 Cycles
64
Operation A
Operation B
4 Cycles
8 Cycles
Operation C
3 Cycles
3 Cycles
# NOP Cycles
64
1/10 Cycles
Operation D - NOP
Paths Dynamically
Balanced
SelectShift Depth Can be Dynamically Changed

Above uses 64 LUTs to Replace 704 Flip-flops & 64 2x1 Muxes
Paths Statically
Balanced
Internal Bus Support
 One Pair of BUFTs Associated with Each CLB



Same ‘Pitch’ as Slice Carry Logic - 2 Bits/Slice
Each BUFT has an Independent Control Input
All CLB Outputs can Source Either BUFT Data Input
 Combine BUFTs to Create Wide Muxes

Replace LUT Based Mux Logic to Increase Density
 Much Faster than Previous Architectures


Approximately 10ns to Span Entire XCV1000 - 96
Columns
Ties Groups of 4 BUFTs with Bi-directional Look Ahead
Scheme Similar to Slice Carry Logic
Internal Bus Support
 And-Or Implementation Replaces Three-State Drivers



Simultaneously Driving BUFTs will not Cause Contention
Capacitance of Entire Load Reduced Dramatically
Slow, Power Hungry Pullups & Weak Keepers Unnecessary
 Output Flexibility


Removal of Pullups Allows for Outputs to Span Rows
Segments of 4 Columns Allow for Many Outputs Per Row
High Performance Routing
 General Purpose Routing
2ns
 Routing Delay Depends on
Radial Distance
 Routing Structure
Designed to Handle High
Fanout Nets


2ns
1000+ Loads - Sub 10ns
Much More Predictable
 Predictability is Critical for
Core Integration & Reuse
 Optimized for 5 Layer Metal
CLB Array
High Performance Routing

Significant Compile Time
Reduction Without Performance
Penalty
CARRY
CARRY
SINGLE
HEX
HEX
HEX
SWITCH
MATRIX
SINGLE
DIRECT
CONNECT
SINGLE
SLICE
SLICE
DIRECT
CONNECT
Local
Feedback
CLB
CARRY
 Algorithmically Friendly Structure
LONG
CARRY

LONG
SINGLE

TRISTATE BUSSES
INTERNAL BUSSES
HEX

Allows For Optimal Connection
Delay, Power, Capacitance &
Resource Utilization
Combined With Timing Driven
Place & Route Yields Superior
Path Delays
Increasing Device Utilization Does
Not Decrease Design Performance
Resource Mix Optimized for Large
Devices - Optimized for 5 LM
LONG

LONG
 Segmented Routing Architecture
High Performance Routing
 Advanced Local CLB Routing
 Massive Hierarchical General Routing Resources
Designed For Speed

24 Singles, 72 Hexes, 12 Longs per Tile
(4KXL: 8 Singles, 4 Doubles, 12 Quads, 12 Longs per Tile)



Selective Connectivity Between Resource Types to Limit
Loading
Longs and Hexes Can be Used as Secondary Global
Resources for Clocks and Controls With Sub 10ns
Delays
Special Backbone Routing in Top and Bottom I/O Edges
to Connect Vertical Longs to Create Low Skew
Resources
 Increased Switch Matrix Connectivity

Higher Connectivity Eliminates Congestion
Advanced Local CLB Routing
 Each LUT Output Can Connect to
the Three Other LUTs



100ps to 300ps Maximum Delay
Create 13-Input Functions Within
the Same CLB - 2.5ns Total Delay
Synthesis Tools Use FastConnects
on Critical Paths
 IMUX Receives 96 Connections
from General Routing Matrix (GRM)

Highly Exhaustive Connection
Matrix
 OMUX Equivalent to 8-bit 13x1 Mux



All 8 Outputs Connect to the GRM
2 Outputs Can be Used to Connect
Directly to the Horizontal
Neighbors
All Outputs Can Feed the 2 BUFTs
CLB
Slice
LUT
LUT
Slice
LUT
LUT
Massive Hierarchical Resources
 Routing Needs Based On XCV1000
 Loading of Resources Minimized
While Connectivity Increased
 Both Long Lines & Hexes are
Buffered To Reduce RC Delays


Longs Have Access Every 6 Tiles
Hexes Have Access at Ends &
Middle
 Special Hexes Added to Top and
Bottom to Create High Fanout
Resources with Vertical Long
Lines
 Horizontal Singles Connect
Directly to Vertical Long Lines for
Fast Control Signal Distribution
Increased Matrix Connectivity
 Previous Families Use Planar Pipulation


Allows for Routing Along Same Channel
Restricts Connectivity of Dissimilar
Resources
Planar pipulation
 Virtex Devices Use Non-Planar
Pipulation



Allows for Routing Across Resource
Types
Longs Drive Hexes, Hexes Drive Hexes
and Singles, Singles drive Singles and
CLB IMUXs - Vertical Hexes Drive CLB
Controls Inputs As Well
CLB OMUXs Drives All Types
 Switch Matrix Connectivity Determines
Design Routabilty

Increased Switch Matrix Connectivity
Alleviates Congestion
Non-Planar pipulation
SelectRAM+
2
System Memory
200 MHz Distributed SelectRAM
200 MHz Block SelectRAM
RAMB4_S4_S16
200 MHz Access to External Memory
WEA
ENA
RSTA
CLKA
ADDRA[9:0]
DIA[3:0]
DOA[3:0]
WEB
ENB
RSTB
CLKB
ADDRB[7:0]
DIB[15:0]
DOB[15:0]
SelectRAM+ Hierarchy
 Distributed SelectRAM




Proven Synchronous RAM of the XC4000 Families
16x1 Implemented in a LUT - 4 in Each CLB
32x1 Implemented in a Slice - 2 in Each CLB
Ideal for DSP Applications
 Block SelectRAM



True Dual Port, Fully Synchronous RAM
4096-Bit Block Configurable in Widths From 1 to 16
Ideal for Data Buffers & FIFOs
 Fast Access to External RAM

133MHz Direct Interface to SSTL3, 3.3V Synchronous DRAM
Distributed SelectRAM
 Builds on XC4000 Tradition



Synchronous Write
Asynchronous Read
No Asynchronous Write
LUT
 Use a Single LUT to Create a
RAM16X1S
 Use a Pair of LUTs to Create a
RAM32X1S or RAM16X1D
 RAM16X1D Comes With One
R/W Address & One Read Only
Address
 Accompanying Flip-Flops Can
Be Used to Register Read
Slice
LUT
LUT
RAM16X1S
D
WE
WCLK
A0
O
A1
A2
A3
RAM32X1S
D
WE
WCLK
A0
O
A1
A2
A3
A4
RAM16X1D
D
WE
WCLK
A0
SPO
A1
A2
A3
DPRA0 DPO
DPRA1
DPRA2
DPRA3
Block SelectRAM
 True Dual Port Synchronous RAM


2 R/W Ports with Independent
Controls
Synchronous Read & Write
RAMB4_S#_S#
WEA
ENA
RSTA
CLKA
ADDRA[#:0]
DIA[#:0]
 Block Count Increases With FPGA
Size



 Flexible 4096-Bit Block


Variable Aspect Ratio
Each Port can be a Different Width
 Synchronous Reset & INIT Values

WEB
ENB
RSTB
CLKB
ADDRB[#:0]
DIB[#:0]
8 Blocks in the XCV50 - 32Kb
32 Blocks in the XCV1000 - 128Kb
Located on Left & Right Sides with 1
Block Every 4 Rows
State Machines, Decodes, Etc
 Sub-10ns Cycle Time For All Widths
DOA[#:0]
DOB[#:0]
Allowed Widths
ADDR
(11:0)
(10:0)
(9:0)
(8:0)
(7:0)
DATA
(0:0)
(1:0)
(3:0)
(7:0)
(15:0)
#/Width
1
2
4
8
16
Depth
4096
2048
1024
512
256
Block SelectRAM
Library Name Specifies Port Configuration
RAMB4_S4_S16
WEA
ENA
Port A In
1K-Bit Depth
RSTA
DOA[3:0]
Port A Out
4-Bit Width
DOB[15:0]
Port B Out
16-Bit Width
CLKA
ADDRA[9:0]
DIA[3:0]
WEB
ENB
Port B In
256-Bit Depth
RSTB
CLKB
ADDRB[7:0]
DIB[15:0]
Each Dual Port can be configured with a different width
Block SelectRAM
 The Dual Ports Access the Same 4096
Bits

4096-Bit Storage When Viewed
by a Port Configured as 1kx4
Nibble
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Combine Blocks For Additional Depth &
Width
 The Depth/Width Ratio Determines How
the Bits are Accessed
 For Example:
A RAMB4_S4_S16 Has a 1kx4 Port & a
256x16 Port
 Provides Easy Data Width Conversion
Without Any Additional Logic

Bit0
Bit4
Bit8
Bit12
Bit16
Bit20
Bit24
Bit28
Bit32
Bit36
Bit40
Bit44
Bit48
Bit52
Bit56
Bit60
DOA[0:3]
Bit1
Bit2
Bit5
Bit6
Bit9
Bit10
Bit13
Bit14
Bit17
Bit18
Bit21
Bit22
Bit25
Bit26
Bit29
Bit30
Bit33
Bit34
Bit37
Bit38
Bit41
Bit42
Bit45
Bit46
Bit49
Bit50
Bit53
Bit54
Bit57
Bit58
Bit61
Bit62
4096-Bit Storage When Viewed by a Port Configured as 256x16
Word
1
2
3
4
Bit0
Bit16
Bit32
Bit48
Bit1
Bit17
Bit33
Bit49
Bit2
Bit18
Bit34
Bit50
Bit3
Bit19
Bit35
Bit51
Bit4
Bit20
Bit36
Bit52
Bit5
Bit21
Bit37
Bit53
Bit6
Bit22
Bit38
Bit54
DOB[0:15]
Bit7
Bit8
Bit23 Bit24
Bit39 Bit40
Bit55 Bit56
Bit9
Bit25
Bit41
Bit57
Bit10
Bit26
Bit42
Bit58
Bit11
Bit27
Bit43
Bit59
Bit12
Bit28
Bit44
Bit60
Bit13
Bit29
Bit45
Bit61
Bit14
Bit30
Bit46
Bit62
Bit15
Bit31
Bit47
Bit63
Bit3
Bit7
Bit11
Bit15
Bit19
Bit23
Bit27
Bit31
Bit35
Bit39
Bit43
Bit47
Bit51
Bit55
Bit59
Bit63
Block SelectRAM
RAMB4_S1
0
WE
1
EN
0
RST
Clock
A[31:20]
N/C
CLK
DO
4095
FFFXXXXX
4094
FFEXXXXX
4093
FFDXXXXX
Subdivide 32-Bit
Address Space into
4096 1MB Blocks
Enable
ADDR[11:0]
DI[7:0]
Using a DLL, the Enable is Available Only 5.1ns
After the Rising Edge of the External System Clock
0002
002XXXXX
0001
001XXXXX
0000
000XXXXX
Build State Machines & PROM Based Address Decodes
Clocking & DLLs
CLKDLL
CLKIN
CLKFB
CLK0
CLK90
CLK180
CLK270
90 MHz
DLL
CLK2X
CLKDV
LOCKED
RST
3
CLK
DLL
DLL
Virtex
Route to Other Devices
45 MHz
(Divide by 2)
180 MHz
(Multiply by 2)
System
Timing
General Clock Support
 4 Dedicated Global Low Skew Buffers


Dedicated Input Pin - Intended to Distribute Clocks Only
66 MHz PCI Performance With 500ps Maximum Skew
–
–
3ns TSetup /0ns THold - Input IOB Flip-flop with No Data Delay
6ns TClock2Out - Output IOB Flip-flop
 24 Additional Shared Resources


Intended to Distribute Low Skew/High Fanout Signals
Distribute Control Signals Across the Device under 10ns
–
additional clocks, clock enables, three-state controls & resets
 4 Delay Lock Loops on Each Device


100% Digital Implementation
2 Global Buffers Associated with Each DLL Pair
DLLs Versus PLLs
 Both types are used to remove clock delay & provide
additional clocking functionality


Frequency synthesis, Phase adjustment & clock conditioning
Both can be implemented using either analog or digital logic
CLKIN
Programmable
Delay Line
Control
Logic
CLKOUT
Programmable
Oscillator
Clock
Distribution
CLKIN
CLKFB
DLLs use Programmable Delay Line
in Conjunction with Control Logic
that Selects the Delay to Match the
Distribution
Control
Logic
CLKOUT
Clock
Distribution
CLKFB
PLLs use Programmable Oscillators in
Conjunction with Phase Detectors &
Filters to Phase Adjust the Clock
DLLs Versus PLLs
 The Oscillator Used in a PLL Inherently
Introduces Instability & Phase Error
 The DLL Architecture is Unconditionally Stable
and Does Not Accumulate Phase Error
 It is Generally Accepted that DLLs are Better for
Delay Compensation and Clock Conditioning
 PLLs Typically Have an Advantage When
Performing Frequency Synthesis and Can
Operate Over a Larger Input Clock Frequency
DLL Functions
Virtex
Speedup Tc2o
Zero-Delay Internal Clock Buffer
Clock Phase Synthesis
For Use Internally Or
Externally
Virtex
Clock Multiplication &
Division
For Use Internally Or
Externally
Clock Mirror
Zero-Delay Board Clock
Buffer
DLL Functions
 Speedup Tc2o by Eliminating Clock Distribution Delay
 Generate Phase Shifted Clocks
 Perform Clock Multiplication & Division
 Cleanup Clocks with 50/50 Duty Cycle Correction
 Generate Clock Lock for Internal & External Use

Can Require Configuration to Synchronize with DLL Lock
 DLL Feedback can be Connected Internally or
Externally
 Can be Used to Create Clock Mirrors & Perform
System Synchronization
DLL Tc2o Speedup
Tclock = 0ns
DLL
CLKext
D Q
>
OUT
Tc2q + Tout = Tc2o
CLKint
 Nullify Clock Delay - Fast Tc2o on XCV1000
External CLKext pin and Internal CLKint pin are Aligned
 2.5ns Setup/0.0ns Hold & 3.5ns Tc2o on All Devices

 Optional Duty Cycle Correction

50/50 Duty Cycle Correction Applied when Specified
 Not sensitive to clock input noise - use standard cans
DLL Phase Shift
 Coarse Phase Shifts
Available



0°, 90°, 180°, and 270°
Available for Internal &
External Use
50/50 Duty Cycle
Correction Available
100MHz - 180° Phase Shift
DLL
100 MHz
(0 Phase)
100 MHz
(180° Shift)
DLL Multiplication
16
16
32
Data
Buffer
IO
Internal
Logic
2x
DLL
CLK
x
 Generate 2x & 4x Clocks

Reduce Board EMI and Trace Concerns by Routing Low
Frequency Clocks Externally and Multiplying Internally
 Cross Clock Domains Without Worry


Multiplied & Divided Clocks Have Synchronized Edges
No External Clock Drift & Minimal External Clock Skew Eliminates Metastable Events
DLL Multiplication
 2 DLLs on Top & Bottom


Use 1 DLL on an Edge for
2x Multiplication or Both
for 4x Multiplication
180 MHz Maximum Output
Frequency
66MHz - 2x Clock Multiplication
DLL
66 MHz
132 MHz
(Multiply by 2)
DLL Division
 Selectable Division Values



1.5, 2, 2.5, 3, 4, 5, 8, or 16
50/50 Duty Cycle
Correction Available
Use DLL Pair to Combine
Functions
Input
180
2X
30 MHz - 180° Phase Shift
DV2
DLL
30 MHz
(180° Shift)
30 MHz
30 MHz
Used for FB
30 MHz
(180° Shift)
DLL
15 MHz
(Divide by 2)
60 MHz
(Multiply by 2)
30 MHz 180° Phase Shift - Clock Multiply & Clock Divide
Clock Mirrors
 Generate Clock Mirrors for
Cascaded & Other Devices
 Extremely Low Output
Skew


Rising Edge Skew -20ps*
Falling Edge Skew +40ps*
*Actual Device Measurements
100MHz - 100MHz Clock Mirror
DLL
100 MHz
LVTTL
100 MHz
LVTTL
Feedback from
External Trace
Input
Output
System Synchronization
 Synchronize All Devices

CLK
DLL
DLL
FPGA 1
DLL
FPGA 2



DLL
FPGA 3
Eliminate Clock Skew
Nullify Clock Input & Board
Delay in Addition to Internal
Distribution Delay
Chip to Chip Race
Conditions Removed
Increase Chip to Chip
Interface Speed - 160MHz
DLL
FPGA N
DLL Modes
 Low Frequency



Input Frequency Range - 25 MHz to 100 MHz
Minimum High/Low Time - 2.2 ns
All 6 Outputs Available for use Internally & Externally
–
CLK0, CLK90, CLK180, CLK270, CLK2X, CLKDV
 High Frequency



Input Frequency Range - 60 MHz to 200 MHz
Minimum High/Low Time - 2.2 ns
3 Outputs Available for use Internally & Externally
–
CLK0, CLK180 & CLKDV
 Both Modes Supported with Simple Design
Primitives

VHDL & Verilog Simulation Support Available
DLL Software Support
 Use BUFGDLL Macro for
Common Clock Usage
BUFGDLL
0ns
 Build Complex Structures
Using CLKDLL Primitive
CLKDLL
CLKIN
CLKFB
RST
Equivalent Structure
CLK0
CLK90
CLK180
CLK270
CLK2X
CLKDV
LOCKED
PAD
BUFG
IBUFG
DLL
FB
To distributed
clock network
SelectI/O
5.0V
1.8V
PCI
3.3V
2.5V
SelectI/O Allows Connection
Directly to External Signals of
Varied Voltages & Thresholds
SSTL HSTL
Future Standards Can be
Supported Without Having
to Make Silicon Changes
4
GTL
System Interfaces
GTL+ AGP
Supply Voltage Migration
Lower cost
Faster speed
Higher density
Lower power
1.2
Feature Size (µm)
1.0
0.8
Virtex FPGAs Ship
0.6
Voltage
5.0
0.4
0.2

0
1990
1992
1994
1996
1998
2000
3.3
2.5
1.8
1.3
2002
Process Technology Migration Leads to Mixed Voltage Systems
Supply Voltage Migration
5V
3.3 V
2.5 V
I/O
Supply
Accepts
5 V levels
Any
5V
device
(XC4000E)
5V
3.3 V
Logic
Supply
Virtex
&
XC4000XV
2.5 V logic
3.3 V I/O
3.3 V
3.3 V
Meets TTL
Levels
 Supply Voltage Sequencing Independent
 Virtex Supports Additional I/O Standards
Any
3.3 V
device
(XC4000XL)
SelectI/O
 Allows Connection & Use of a Wide Variety of
Devices




Processors, Memory, Bus Specific Standards, Mixed Signal...
Provides Industry Standard IEEE/JDEC I/O Standards
Maximizes Speed/Noise Tradeoff - Use Only What is Needed
Can Connect to or Create High Performance Backplanes
– PCI, GTL+, HSTL
– DIY - Virtex Based Backplane Design in Progress
 Define I/O by Simply Placing Desired Input And/Or
Output Buffers Into the Design


Special IBUF and OBUF Components Provided in Schematic
Based and HDL Based Design Flows
For Example: SSTL3, Class I Output Buffer - OBUF_SSTL3_I
Simplified IOB Structure
 Fast I/O Drivers
 Separate Registers for
Input, Output & ThreeState Control


Asynchronous Set or
Reset Available on Each
Flip-flop
Common Clock, Separate
Clock Enables
 Programmable Slew Rate,
Pullup, Input Delay, Etc
 Selectable I/O Standard
Support
 Supported Standards List
can be Updated After
Testing
DFF/LATCH
D
Q
CE
S/R
DFF/LATCH
D
Q
CE
S/R
DFF/LATCH
D
Q
CE
S/R
PAD
How It Works
SelectI/O Output
SelectI/O Input
Configuration Bits
OBUF_SSTL3_I
IBUF_SSTL3_I
SSTL3 Class1
Output Driver
SSTL3 Class1
Input Receiver
How It Works
 Separate I/O & Core Supply
Rails
 Programmable Driver Strength
P & N Drivers Individually
Controlled
 16 Different Setting for Each

 Variable I/O & Vref Voltages
8 Banks on Each Device
 Specific I/Os are Used as
Reference Inputs

 Differential Inputs Supported
nMOS for High Vref
 pMOS for Low Vref

VCCO
Currently Supported Standards
Standard
LVTTL
LVCMOS2
PCI 33MHz 3.3V
PCI 33MHz 5.0V
PCI 66MHz 3.3V
GTL
GTL+
HSTL-I
HSTL-III
SSTL3-I
SSTL3-II
SSTL2-I
CTT
AGP
VCCO
3.3
2.5
3.3
3.3
3.3
na
na
1.5
1.5
3.3
3.3
2.5
3.3
3.3
Vref
na
na
na
na
na
0.80
1.00
0.75
0.90
0.90
1.50
1.10
1.50
1.32
Application
General Purpose
PCI
Back-Plane
Hitachi SRAM
SDRAM
Memory
Graphics
I/O Performance
Virtex Chip-Chip I/O Performance
SSTL3
AGP
I/O Standard
HSTL IV
PCI-3.3V
LVCMOS2.5V
TTL-Fast 24mA
TTL-Fast 12mA
TTL-Slow 12mA
TTL-Slow 2mA
0
50
100
150
200
Maximum Chip to Chip I/O Frequency = 1/(Tsetup + Tc2o)*
*DLLs Used to Eliminate Clock Distribution Delay
SelectI/O Banks
BANK 1
BANK 5
BANK 4
BANK 3
BANK 6
BANK 2
BANK 7
BANK 0
SelectI/O Banks
 Each Device is Broken in 8 Banks Regardless of Size

2 Banks on Each Side of the Device
 Each Bank has Voltage Sources Shared Among
Associated I/Os in that Bank

All I/O Requiring a Voltage Source Must be of the Same Type
 Input Banking - Vref
I/O Standards Which use a Differential Amplifier Require a
Voltage Reference Input
 All Fixed Location/Dual Purpose Vref Inputs in a Bank Must be
Used When Supplying a Voltage Reference

 Output Banking - Vcco

Dedicated Pins provide drive source voltage for output pins
SelectI/O Input Banks
 1 Voltage Reference can be Supplied in a Bank
 Any input not requiring a Vref can be placed in Bank
 Flexible Use of Voltage Reference Inputs
Pins Can be Used as General Purpose I/O If a Voltage
Reference is Not Needed - All Must be Used to Supply a
Voltage Reference
 Locations are Fixed for Each Device/Package Combination

 Any Single Output Buffer Type Can be Placed in the
Bank

Multiple Output Buffer Types Must Adhere to Output Bank
Rules
 OBUFTs with Keepers Circuits Requiring a Voltage
Reference are Treated as IOBUFs
SelectI/O Output Banks
 Only One Vcc Output is Supplied to Each Bank
 Any Output Not Requiring Use of the Vcc Output
can be Placed in the Bank
 Any Single Input Buffer Type Can be Placed in the
Bank

Multiple Input Buffer Types Must Adhere to Input Bank
Rules
 Special Consideration Must be Given to
Configuration I/O



Configuration I/O is Located on the Right Side of the
Device
Serial PROM Downloads Require Vcco Set to 3.3V In
Banks 2 & 3
Non-PROM Serial Downloads will generate warning
(Even though Vcco Connection dependent on data source)
Thermal Management
Thermal Challenge
 Today’s FPGA Density is
Absorbing Large
Percentages of Board
Designs
Ambient
Temp
Data
 Because of its Highly
Demands
Dynamic Nature, Power
Can Only be Estimated
Before Design Completion
 Even as Voltages
Decrease, Power
Consumption is a Major
Concern
 How do I Know My Die
Temp is Within Spec?
Heat
Sinking
Vcc
Tolerance
Virtex XCV1000
75M Transistors*
100+ MHz
Advanced Signal
Processing Apps
20W+ Power
Dissipation
* Pentium II = 7.5 Million Transistors
Thermal Solution
Maxim MAX1617
2-Pin SMBUS
Serial Interface
Interrupt
SBMCLK
SBMDATA
DXP
DXN
Virtex
DXP
DXN
ALERT*
 Remote Die Sensor



Specially Designed to be Used With the Maxim
MAX1617
Simple 2-Pin Interface with no Calibration Required
Provides Two Channels
–
–


FPGA Die Temp Reported from -40°C to +125°C at +/- 3°C
Maxim Die Temp also at +/- 3°C
Programmable Over-Temp & Under-Temp Alarms
Same Technology as Pentium II
 System Management is Now Possible
SelectMAP
Advanced Configuration
Master/Slave Serial
JTAG
SelectMAP
Simple Serial Interface
System Integrated Serial
Virtex
High Performance Parallel
 Simplified Configuration Mode Set
 50 Megabyte/Second Download Rate Using
SelectMAP
 Dedicated JTAG Port - No Contention Issues
 No Master Parallel Support
 Direct, JTAG & SelectMAP Device Readback
Software & Cores Support
HDL Design Entry Focus
 Synthesis Support is Critical for Large Designs






Architecture Decisions Made Based on Synthesis Tool Tendencies
Xilinx Relationships With Synthesis Vendors Initiated Direct 4-Input
LUT & Carry Chain Synthesis - The Building Blocks of XL & Virtex
Xilinx Will Continue to Drive Synthesis Vendors to Support Virtex
Specific Features - Block SelectRAM, SelectShift & CLKDLLs
Virtex Architecture Adds Additional Resources That Synthesis
Vendors Easily Synthesize To Today
Implementation Software Written With Synthesis Tool Flow Focus
All Three Major Synthesis Vendors Supported Virtex for Beta
 Large Designs Also Require Team Based Design

Must be able to Support Multiple Designers on the Same Device as
Well as Core Integration
Implementation Software
 Virtex Software is built on proven M1 technology




Builds on Robust Integration with Third Party Design Entry Tools
Emphasizes Constraint Driven Design Philosophy
Vector Based Interconnect Yields More Predictable Routing
Results
Predictable Results Allows the Placement Algorithms to Make
Better Routing Estimations in Must Less Time
 Architecture fully software tested before 1st silicon



Virtex Implementation Software Was Available 18 Months
Before Actual Silicon was Produced
Used Proven Place & Route Software as a Gauge of the
Architecture’s Ability to Meet Density & Performance
Needs
Early Software Allowed for Changes to be Made in the
Finalization of the Architecture - Necessary Routing Mix,
Special Features, etc
A System Level Solution
2
System
Integration
1
4
System Memory
3
System
Timing
System Interfaces
Virtex is a True System Level Solution
A System Level Solution
 Virtex Opens New System Level Applications to
FPGAs
1


Extremely Dense - 50,000 to 1,000,000 System Gates
Flexible Architecture
–
–

Vector Based Interconnect
–
–

Efficient for Random Logic, Memory, DSP & Data Path Circuits
Automatically Implemented by Today’s Leading Synthesis Vendors
Much More Predictable Before Place & Route
Enhances Synthesis Based Flows
Excellent Platform for Core Integration
–
Software Based on Proven M1 Timing Driven Place & Route
 Hierarchical Memory Support
2

SelectRAM+ Can be Used to Create Bytes or KBytes of
Internal Storage and Access MBytes of Fast External
Memory
A System Level Solution
 System Speedup & Synchronization
3



Nullify Clock Distribution Delays - 160 MHz System
Performance
Synthesize Clocks for Internal and External Use
Synchronize Systems - Create Clock Mirrors & Nullify
Board Delay
 Flexible System Interface
4



Controllable Current, Input Vref and Vcco
Characteristics
Connect Directly to Existing & Emerging I/O Standards
SelectMap Protocol Allows for Easy Interfacing to
µControllers and µProcessors
–
–
–
400+ Mb/sec Configuration, Verify & Debug Using a Simple 8-Bit
Interface
SelectMAP Port Can Remain on After Configuration
JTAG Can Also be Used to Configure

Download Report

Virtex Architecture NDA Presentation

Paperzz.com

Your Paperzz