System Integration Workshop

Introduction to C6000
Chapter 1
C6000 Integration Workshop
T TO
Technical Training
Organization
Copyright © 2005 Texas Instruments. All rights reserved.
What Problem Are We Trying To Solve?
x
ADC
Digital sampling of
an analog signal:
DSP
Y
DAC
Most DSP algorithms can be
expressed with MAC:
count
A
Y =

i = 1
t
T TO
Technical Training
Organization
coeffi * xi
for (i = 1; i < count; i++){
Y += coeff[i] * x[i]; }
'C6000 CPU Architecture
Memory
A0
B0
.D1
.D2
.S1
.S2
Dual MACs
..
A15
..
A31
T TO
Technical Training
Organization
.M1
.L1
.M2
.L2
Controller/Decoder

‘C6000 Compiler excels at
Natural C

While dual-MAC speeds
math intensive algorithms,
flexibility of 8 independent
functional units allows the
compiler to quickly perform
other types of processing

All ‘C6000 instructions are
conditional allowing efficient
hardware pipelining

‘C6000 CPU can dispatch up
to eight parallel instructions
each cycle
..
B15
..
B31
Given this simple loop …
40
y = 
an * xn
n = 1
a
x
cnt
prod
y
*ap
*xp
*yp
.S1
short mac(short *m, short *n, int count) {
for (i=0; i < count; i++) {
sum += m[i] * n[i]; } …
.M1
.L1
[cnt]
T TO
.S1
40, cnt
LDH
.D1
*ap++, a
LDH
.D1
*xp++, x
MPY
.M1
a, x, prod
ADD
.L1
y, prod, y
SUB
.L1
cnt, 1, cnt
B
.S1
loop
STW
.D
y, *yp
loop:
.D1
Technical Training
Organization
MVK
How many of these instructions can we get in parallel?
C62x Intense Parallelism
short mac(short *m, short *n, int count) {
for (i=0; i < count; i++) {
sum += m[i] * n[i]; } …
MPY
||
MPYH
|| [B0] B
||
LDW
||
LDW
.M2
.M1
.S1
.D1
.D2
B7,A3,B4
B7,A3,A5
L3
*A4++,A3
*B6++,B7
L2: ; PIPED LOOP PROLOG
MPY .M2 B7,A3,B4
||
MPYH .M1 B7,A3,A5
Given this
code
LDW C
.D1
*A4++,A3
|| [B0] B
.S1 L3
||
LDW .D2 *B6++,B7
||
LDW .D1 *A4++,A3
The C62x compiler can achieve||
LDW .D2 *B6++,B7
LDW .D1 *A4++,A3
;** -----------------------*
Two
|| Sum-of-Products
LDW .D2 *B6++,B7 per cycle
L3:
[B0] B
.S1 L3
||
LDW .D1 *A4++,A3
||
LDW .D2 *B6++,B7
[B0] B
.S1 L3
||
LDW .D1 *A4++,A3
||
LDW .D2 *B6++,B7
[B0] B
.S1 L3
||
LDW .D1 *A4++,A3
||
LDW .D2 *B6++,B7
T TO
Technical Training
Organization
||
||
||
||
||
||
||
; PIPED LOOP KERNEL
ADD .L2 B4,B5,B5
ADD .L1 A5,A0,A0
MPY .M2 B7,A3,B4
MPYH .M1 B7,A3,A5
[B0]B
.S1 L3
[B0]SUB .S2 B0,1,B0
LDW .D1 *A4++,A3
LDW .D2 *B6++,B7
;** -----------------------*
What about the ‘C67x?
C67x MAC using Natural C
Memory
The C67x compiler gets two 32-bit
A0
B0
.D1
.D2
floating-point
Sum-of-Products per iteration
.M1
.M2
.L1
.L2
..
A15
T TO
Technical Training
Organization
..
.S1
.S2
Controller/Decoder
B15
float mac(float *m, float *n, int count)
{ int i, float sum = 0;
for (i=0; i < count; i++) {
sum += m[i] * n[i]; } …
;** --------------------------------------------------*
LOOP: ; PIPED LOOP KERNEL
LDDW .D1
A4++,A7:A6
||
LDDW .D2
B4++,B7:B6
||
MPYSP .M1X
A6,B6,A5
||
MPYSP .M2X
A7,B7,B5
||
ADDSP .L1
A5,A8,A8
||
ADDSP .L2
B5,B8,B8
|| [A1] B
.S2
LOOP
|| [A1] SUB
.S1
A1,1,A1
;** --------------------------------------------------*
Can the 'C64x do better?
C64x gets four MAC’s using DOTP2
short mac(short *m, short *n, int count)
{ int i, short sum = 0;
DOTP2
m1
m0
A5
n0
B5
x
n1
=
m1*n1 + m0*n0
A6
+
running sum
T TO
Technical Training
Organization
A7
for (i=0; i < count; i++) {
sum += m[i] * n[i]; } …
;** --------------------------------------------------*
; PIPED LOOP KERNEL
LOOP: ADD
.L2
B8,B6,B6
||
ADD
.L1
A6,A7,A7
||
DOTP2 .M2X B4,A4,B8
||
DOTP2 .M1X B5,A5,A6
|| [ B0] B
.S1
LOOP
|| [ B0] SUB
.S2
B0,-1,B0
||
LDDW .D2T2 *B7++,B5:B4
||
LDDW .D1T1 *A3++,A5:A4
;** --------------------------------------------------*
How many multiplies can the ‘C6x perform?
MMAC’s

How many 16-bit MMACs (millions of MACs per second)
can the 'C6201 perform?
400 MMACs

(two .M units x 200 MHz)
How about 16x16 MMAC’s on the ‘C64x devices?
2 .M units
x
2 16-bit MACs (per .M unit / per cycle)
x
1 GHz
---------------4000 MMACs

How many 8-bit MMACs on the ‘C64x?
8000 MMACs (on 8-bit data)
T TO
Technical Training
Organization
C6415 DSP (1 GHz)
EMIF 16
12.5 MB/s
McBSP 0
McBSP 1
or
Utopia 2
12.5 MB/s
McBSP 2
133 MB/s
HPI32
JTAG
RTDX
T TO
Technical Training
Organization
Power
Down Logic
PLL
32 GB/s
TM
C64x
CPU Core
5760 MIPS
16 GB/s
100 MB/s
2.9 GB/s
12.5 MB/s
32 GB/s
266 MB/s
L1P Cache
L2 Memory
EMIF 64
Enhanced DMA Controller (64 channels)
1064 MB/s
Timer 0
16 GB/s
L1D Cache
Timer 1
Timer 2
How does the DSP fit into a system?
Example C6000 System
Timer /
Counters
Switches
Lamps
Latches
FPGA
Etc.
/
0-16+
Reset
NMI
Ext Interrupts
PCI
/
HWI
PLL
VCP TCP
Utopia 2
C6000
CPU
DM64x
T TO
Technical Training
Organization
ATM
McASP
Audio Codec
/
PCI
/
HPI
16 or 32
EDMA
McBSP
Serial Codec
Boot
Loader
EMIF
EMAC
Ethernet
16, 32, or 64-bits
Video Ports
/
8
4
32
Host P
GPIO
Clockin
Clockout
Clockoutx
EPROM
SDRAM
(TCP/IP stack avail)
Sync
SRAM
Note: Not all ‘C6000 devices have all the various peripherals shown above.
Please refer to the C6000 Product Update for a device-by-device listing.
C6416T DSK
T TO
Technical Training
Organization
Diagnostic Utility included with DSK ...
DSK’s Diagnostic Utility
T TO
Technical Training
Organization

Test/Diagnose
DSK hardware

Verify USB
emulation link

Use Advanced
tests to facilitate
debugging

Reset DSK
hardware
DSK Contents ...
DSK Contents (i.e. what you get…)
Documentation


DSK Technical Reference
eXpressDSP for Dummies
Software



Code Composer Studio
SD Diagnostic Utility
Example Programs
Hardware



1GHz C6416T DSP
or 225 MHz C6713 DSP
TI 24-bit A/D Converter (AIC23)
External Memory


T TO
Technical Training
Organization
8 or 16MB SDRAM
Flash ROM - C6416 (512KB)
- C6713 (256KB)
MISC Hardware




LEDs and DIPs
Daughter card expansion
1 or 2 additional expansions
Power Supply & USB Cable
Lab 1
Hardware
Software
1. Hook up the DSK
1. Run Diagnostic Utility
2. Supply power and
observe POST
2. Run CCS Setup
3. Start CCS
4. Configure CCS Options
5. Close CCS
CCS
T TO
Technical Training
Organization
Time: 20 minutes
Technical Training
Organization
ti