AverageASM

AverageASM_Part4

Averaging Filter
Comparing performance of
C++ and ‘our’ ASM
Example of program development
on SHARC using C++ and assembly
Planned for Tuesday 7rd October Afternoon
Practical examples handled in Lab 1
1
Demo (uTTCOS) and Test (E-UNIT)
configurations
Test InAudio
array
True A/D
DMA CHANNEL
True
leftChannel_In
YOUR SOFTWARE
Audio ISR with
Filter
YOUR SOFTWARE
True
leftChannel_Out
DMA CHANNEL
True D/A
MOCK ReceiveD2A
Test
Set up InAudio[ ]
Set up Expected[ ]
In Loop {
Call Filter to
produce
OutAudio[ ]
}
Compare
Expected[ ] and
OutAudio
Mock
leftChannel_In
YOUR SOFTWARE
Filter
YOUR SOFTWARE
Mock
leftChannel_Out
Mock TransmitA2D
Test OutAudio
array
2
Mock Device Registers “satisfy linker”
CCES says “inconsistent” definition
• Poor mock – we move values in Audio Device
registers by hand
• Can we “MOCK” – Receive_ADC_Samples
– Typical industrial testing approach needed when
hardware “NOT-YET-DEVELOPED
3
Better Simulation
• What is – the algorithm is “by mistake” still doing
Left_Out = Left_In (Copy), then we would get the
same answer
• Currently “LeftChannel_In1” is a fixed constant –
making it difficult for us to check whether our
algorithm would work for more complex signals
• So we could start testing the algorithm validity
(not its speed) by changing LeftChannel_In1 by
“mocking “ReciveA2D( )” and “TransmitD2A( )
audio devices
4
Using ‘MockDevice.c” loads (RECAP)
What do we do about ‘Receive_ADC_Samples ( )?’
• These ‘mock’ routines satisfy a linker requirement
for a function we don’t use. When they need to
become more detailed, worry about then (WAIL).
5
Mocked device inside Assign1Library
Can be used during Lab 1 -- 4
MADE
PRIVATE
(FIXED)
GOOD OR
BAD IDEA?
VARIETY OF
ALGORITHMS
TESTED
6
Use GUI to add new test group for
Averaging code – 3 styles of tests (RECAP)
7
Testing
• Test that it works
• Test that it meets real time performance
– Measure ms / Sample for 1 channel = Time-1CH
– Require 20 ms > 8 * Time-1CH
• Move code onto Resource chart.
– Determine theoretical best time if all optimizations Could be found
• Test to determine real cycle count Cycle / Tap / Sample
• Examine CPP .lst file (.i or .is) or your ASM file to determine expected cycle
count
– Work out why the difference between theory and real
– Looking at accuracy of better than 1 cycle in 1000
– Assume 1 cycle per instruction except jumps and memory accesses and
movement of I registers to memory – or any other delay we find common
• Be able to move the theoretical calculation for other processor
architecture (timings) for MidTerm 1 on Thursday 23rd Oct
8
Theoretical Analysis
• We expect our theoretical analysis to be fast or
faster than what the C++ optimized code takes
• We are not using any C++ DSP extensions, so
expected efficient rather than optimized code
• Is 816 cycles per sample processed by Average
Filter the speed we would expect based on our
understanding of the processor architecture?
9
Expectations
• First instruction after a jump takes 3 cycles to
finish executing
• After that 1 instruction, all things being equal,
takes 1 cycle
• 1 cycle for a read, write, add, multiple
• D? cycles for a division
10
Averaging Filter with Loop
Theoretical Analysis
• Fetch N values
from memory -- N cycles
• Perform N add operations
-- N cycles
• Go round the sum for-loop -- N * FLC cycles
– Where FLC is # instructions to handle For-Loop-Control – includes all-overheads
of jumping dufing for-loop
•
•
•
•
Exit for loop (done once)
-- EFL cycles
Do division
-- D cycles
Return a value from function -- RV cycles
Enter and exit Average routine -- EER cycles
--
AVERAGE_FILTER_TIME = N(1 + 1 + FLC) + EFL + D + RV + EER cycles
VERY BIG DEFECT IN ANALYSIS FOUND LATER
ACTUAL THEORETICAL TIME IS TWICE AS LARGE AS THIS
11
Modify tests so can handle both CPP
and ASM versions (Cut-and-paste)
• Not the
timing
that’s the
problem
at this
moment
• It’s ‘does
the ASM
and CPP
code
work’ at
all!
12
Check what function needs developing
• Fix compiler error with prototype in ‘Assign1.h”
• Linker error message says ‘wrong prototype’ (NM)
13
Check to see if can run the Tests that
call ASM code without crashing
C++ prototype
extern “C” void Function(void)
14
Getting the same constants in an include
file working in both CPP and ASM
• Use this type of syntax in ‘Assign1.h’
– Conditional code generation
• And in assembly code files
15
Initial testing done with small N
N = 4 (as can work out expected result)
• Write the test
– C++ code expected to pass
– 3.3 is EXACTLY (N – 1) / N of 4.4 when N is 4
16
Look for ‘one out error’ in loops
Common DSP mistake
• Remember to fix error in ASM ‘pseudo code’
17
Initial testing done with small N
N = 8 (as can work out expected result)
• Write the test
– C++ code expected to pass
– Asm code MUST fail test – otherwise test is poor
– Must fail as there is no ASM code to allow pass to
occur. This is the TEST of the TEST
Now have 4 tests passing
rather than 3, including ASM
test
INDICATES BAD TEST – WHY?
18
Improved test. Don’t allow ‘old
correct value’ in output from C++ test
Defect might have been identified by reversing test order
19
What registers can we use in assembly?
• Don’t use
without
performing
save immediately
and later recover
operations.
• Otherwise C and
C++
will crash
• These okay
to use in
assembly
20
Here’s the full software loop structure
Note the formatting for easy code review (Required)
Each time around
Loop – 9 cycles for
Control
Not the 5 we
thought
21
dm(2, I4) versus dm(I4, 2)
dm(M4, I4) versus dm(I4, M4)
• Both instructions use the ‘eye’ 4 index register (volatile)
• dm(2, I4) – is a pre-modify memory operation
– The 1 is before the I4 – hence pre something
– I4 points to a memory location
– Dm(2, I4) means access the memory location at (I4 + 2)
• ADD IS NOT preformed in parallel with other operations?
– LEAVE value in index register I4 unchanged
– Used in array addressing
• Dm(I4, 2) – is a post-modify memory operation
–
–
–
–
The 2 is after the I4 – hence post something
I4 points to a memory location
Dm( I4, 2) means access the memory location at (I4)
MODIFY value in index register by 2
• DO I4 = I4 + 2 AFTER USING I4 (ADD in parallel with other
operations?)
22
23
Other bits of code needed
24
Add assembly language ‘externs’ to
‘Assign1.h
• Still have not coded
the division – fake it by hard-coding * 1/4
• Must be an easier way to code memory
– Yes – use post increment operation using pointers
and not using array indexing
25
Code fails -- Most likely place to look
for defects are in loop operations
Forgot to set loopCounter =0
And loopMax to N when we
Added code for the new loops
26
Try persuading the “assembler” to pre-calculate
F3 = (1.0 / N) at ‘compile time’, not ‘run-time’
Code should now work for
N = 64 – so can compare timing with
C code
27
If we believe tests then calculation
accuracy is lower (5E-06 for larger N)
Despite lousy ASM code
we already beating compiler
in ‘debug’ mode(around 2N)
28
Before optimizing, we need to add a
few more tests to check code valid
Uses sum of N integers
N (N + 1) / 2
Accuracy now set to 1E-5
29
Use post-modify address mode
sum = sum + *pt++; ( N = 64)
2 cycle stall till
M4 ready to use?
• ASM was 2400 cycles (N = 64), is now 2208
– Expect improvement of N = 64 cycles (2 instead of 3 instructions)
30
– Get (2400 – 2208) = 192 which is very close to 3 * N = 196 faster
dm(2, I4) versus dm(I4, 2)
dm(M4, I4) versus dm(I4, M4)
• Both instructions use the ‘eye’ 4 index register
• dm(2, I4) – is a pre-modify memory operation
–
–
–
–
–
The 2 is before the I4 – hence pre something
I4 points to a memory location
Dm(2, I4) means access the memory location at (I4 + 2)
LEAVE value in index register I4 unchanged
Used in array addressing
• Dm(I4, 2) – is a post-modify memory operation
–
–
–
–
The 2 is after the I4 – hence post something
I4 points to a memory location
Dm( I4, 2) means access the memory location at (I4)
MODIFY value in index register by 1 (I4 = I4 + 2 AFTER USE)
• POST MODIFY OFFERS OPPORTUNITY FOR PROCESSOR
ARCHITECTURE TO DO ADD IN PARALLEL WITH OTHER PIPELINE
STAGES
31
Using pre-modify and post-modify
addressing – replace 6 instructions by 2
Expect 4 * N faster (256)
Was 2208, is 1704 = 500 cycles
Close to N * 6 faster!
32
Need to force “C++” to optimize
CONCLUSION
We have a lot more
to learn
about using the
processor
architecture
correctly in order to
get HIGH SPEED DSP
CODE
NOTE: COMPILER
ASSUMES GENERAL
DSP, CODE
CHARACTERISTICS
• Our asm code 1704 cycles
• Optimized “C” 205 cycles
– 1500 cycles faster or roughly N * 23.5 cycles faster
WE KNOW MORE, so
should be able to
write faster code (if
we need to)
• FIFO Loop (63 reads / 63 write) + sum loop (64 reads + 64 adds) = 256
• Loop control = 2 * 64 * 9 + Into / out of subroutine 20 + other 10 = 1182
– Our ASM = 1468 + 236 unaccounted for (N * 3.7 or nearly N * 4)
33

Download Report

AverageASM_Part4

Paperzz.com

Your Paperzz