CR18: Advanced Compilers L03: Scheduling

CR18: Advanced Compilers
L03: Transformations
Tomofumi Yuki
1
Today’s Agenda
 Transformations in the polyhedral world
 Unimodular framework
 Tiling
2
Transforming Polyhedral
Objects
 Recall: Set/Relation/Function
 What are the math operators over these?



intersection/union
image/pre-image
join/compose
 What does it mean to the code?
3
Simplifying Assumption
 The order of operations is the lex. order

changing the shape of domain =
changing the order of execution
 We will revisit this later
4
Loop Skewing
 “Shift” the iteration space
for (i=1; i<=N; i++)
for (j=1; j<=M; j++)
S1;
j
for (i=1; i<=N; i++)
for (j=1+i; j<=M+i; j++)
S1;
i
5
Why do you want to skew?
 What happens to the dependences?
for (i=1; i<=N; i++)
for (j=1; j<=M; j++)
A[i,j] = A[i-1,j+1];
for (i=1; i<=N; i++)
for (j=1+i; j<=M+i; j++)
???;
j
i
6
Change of Basis
 Building block of transformations for poly. IR

how to keep the IR consistent
 CoB : Apply a transform to a statement
domain


transform = affine function
what are the effects to the full PRDG/Alpha?
7
CoB: Skewing Example
 Apply (i,j -> i,j+i) to the following
S0: [N] -> { [i,j] : 0<=i,j<=N }
S1 : [N] -> { [i,j] : 0<=i,j<=N }
//ignore boundary cases
S0[i,j] = {j>0} : S0[i,j-1] + S1[i,j];
S1[i,j] = {i>0} : S0[i-1,j];
 Is this correct?
S0 : [N] -> { [i,j] : 0<=i<=N and i<=j<N+i}
S1 : [N] -> { [i,j] : 0<=i,j<=N }
S0[i,j] = {j>0} : S0[i,j-1] + S1[i,j];
S1[i,j] = {i>0} : S0[i-1,j];
8
CoB: Main Idea
 Apply function f to statement S

4 cases of dependences
dom(S)
dom(T)
f
dom(S’)
9
CoB: Main Idea
 What happens to the dependences?
dom(S)
f-1@I
dom(T)
I
f
dom(S’)
I@f
f-1@I@f
10
How to skew in ISCC
 Apply the transformation and generate code
for (i=1; i<=N; i++)
for (j=1; j<=N; j++)
A[i,j] = A[i-1,j+1];
j
D:=[N]->{[i,j] : 1<=i,j<=N};
F:=[N]->{[i,j] -> [i,i+j]};
E := range(F*D);
codegen E;
i
for (int c0 = 1; c0 <= N; c0 += 1)
for (int c1 = c0 + 1; c1 <= N + c0; c1 += 1)
(c0, c1);
11
How to skew in ISCC
 You get different code if you use relation
for (i=1; i<=N; i++)
for (j=1; j<=N; j++)
A[i,j] = A[i-1,j+1];
j
E := [N]->{[i,j]->[i,i+j] :
0<=i,j<=N };
codegen (E);
i
for (int c0 = 1; c0 <= N; c0 += 1)
for (int c1 = c0 + 1; c1 <= N + c0; c1 += 1)
(c0, -c0 + c1);
12
Schedules
 Mapping from iteration points to exec. order

different notions exist
 In the next lecture, we will cover how to
automatically come up with them

for now, we will figure out on our own
13
Schedules
 Not necessarily one iteration at a time
j
 But codegen can become
ambiguous
R := [N] -> {
S1[i,j] -> [i] : 0<=i,j<=N;
S2[i,j]->[i] : 0<=i,j<=N};
codegen R;
for (int c0 = 0; c0 <= N; c0 += 1) {
for (int c2 = 0; c2 <= N; c2 += 1)
S2(c0, c2);
for (int c2 = 0; c2 <= N; c2 += 1)
S1(c0, c2);
}
i
(i,j->i)
14
Space-Time Mapping
 More precise notion


time: time stamp for iterations
space: (virtual) processor assigned
 Example
 θ(i,j) = i
 π(i,j) = j
θ
π
for i
forall j
S
j
i
15
Schedules in Loop Context
 Often used interchangeably
 How to transform a loop
j
[N] -> { [i,j] -> [i,i+j] };
+
parallel := [true, false]
↓
i
forall i
for j
S;
16
Schedules in Loop Context
 You also use constant dimensions


to modify loop structure
to specify statement orders
θS1(i) = [0,i]
θS2(i) = [1,i]
for i
S1;
for i
S2;
θS1(i) = [i,1]
θS2(i) = [i,0]
for i
S2;
S1;
17
Schedule in ISL
 When a relation is given to codegen, RHS is
supposed to be the schedule

should be in common space
D:=[N] -> { S1[i] -> [0,i] : 0<=i<=N;
S2[i] -> [1,i] : 0<=i<=N };
codegen D;
{
for (int c1 = 0; c1 <= N; c1 += 1)
S1(c1);
for (int c1 = 0; c1 <= N; c1 += 1)
S2(c1);
}
18
Schedule in ISL
 Otherwise it is treated as unordered
D:=[N] -> { S1[i] -> A[0,i]
S2[i] -> B[1,i]
codegen D;
{
for (int c1 = 0; c1 <= N;
S2(c1);
for (int c1 = 0; c1 <= N;
S1(c1);
}
: 0<=i<=N;
: 0<=i<=N };
c1 += 1)
c1 += 1)
19
Overloading of Schedules
 Schedule may be




timestamp for iterations (θ)
space-time mapping (θ+π)
abstraction of loop structure (θ with 2d+1)
code gen strategy (θ+ aux. info)
20
Back to CoB
 What are the properties required for f?
dom(S)
f-1@I
dom(T)
I
f
dom(S’)
I@f
f-1@I@f
21
Unimodular Framework
 Earlier variant of polyhedral-ish framework
 Model transformations as f = Ax

where A is unimodular
 What does this restriction bring?
22
Unimodular Framework
 Main Flow



select a transformation (will skip for today)
apply the transformation to the loop bounds
apply the transformation to array accesses
 Limitations


much coarser grained than polyhedral
same transformation for the entire loop nest
23
Unimodular Framework
 Given transformation T


loop bounds : L.x ≥ m -> L.T.x ≥ m
array accesses: A.x ≥ b -> A.T-1.x ≥ b
 Recall CoB



changes to bounds = exactly the same
let array access function be g
g(i) -> g@(f-1(f(i))
 More or less subsumed by polyhedral model
24
Key Feature of Unimodular FW
 Composition of transformations
 Applying transformation T1 and then T2

easy to show that it is equivalent to applying a
single transformation T2.T1
 Enables exploration of arbitrarily
combinations of transformations

e.g., skew + interchange is just another matrix
25
What is missing?
 What are the space of transformations that
can be expressed in unimodular FW?
26
27
Loop Skewing
 “Enabler” transformation
for (i=1; i<=N; i++)
for (j=1; j<=M; j++)
A[i,j] = A[i-1,j+1];
j
 We already saw this one

distance vector: [1,-1]
i
 Find f([1,-1]T) s.t.?
28
Loop Fusion
 Can you fuse these loops?
for (i=1;
A[i] =
for (i=1;
B[i] =
i<=N; i++)
foo();
i<N; i++)
bar(A[i]);
S0: [N]-> { [i] : 0<=i<=N };
S1: [N]-> { [i] : 0<=i<N
};
 Why would you want to fuse them?
 What are the schedules?
29
Loop Fusion 2
 Can you fuse them now?
for (i=1;
A[i] =
for (i=1;
B[i] =
i<=N; i++)
foo();
i<N; i++)
bar(A[i+1]);
S0: [N]-> { [i] : 0<=i<=N };
S1: [N]-> { [i] : 0<=i<N
};
30
Affine Loop Transformations
 It is some composition of:








loop skewing
loop permutation
loop reversal
loop shifting
loop fusion
loop fission
loop unrolling
loop tiling
unimodular
31
Tiling
 Well-known transformation in HPC [Wolfe 87]
for t=0; t<T; t++
for i=1; i<N-1; i++
A[i] = f(B[i], B[i-1], B[i+1])
//swap A and B

?
Improves data locality
 Performance improvement


my laptop: 5.95s  4.76s (20%)
PLuTo: ~1.8x (with older processor)
32
So what is Tiling?
 Loop transformation

Effect: change the order of execution
for t=0; t<T; t++
for i=1; i<N-1; i++
foo()
for tt=0; tt<T; t+=x
for ti=1; ti<N-1; ti+=y
for t=tt; t<min(tt+x,T); t++
for i=ti; i<min(ti+y,N-1); i++
foo()
33
Visualization of Tiling
 Improve locality through temporal locality
 Each tile becomes an atomic unit
i
i
t
t
34
What is Tiling?
 Loop transformation that doubles the depth
for (t=0; t<N; t++)
for (i=0; i<M; i++)
S
i
Tile Origins
Tile Loops
t
for (x=0; x<N; x+=3)
for (y=0; y<M; y+=3)
S’
35
What is Tiling?
 Loop transformation that doubles the depth
for (t=0; t<N; t++)
for (i=0; i<M; i++)
S
i
Point Loops
t
for (t=x; t<x+3; t++)
for (i=y; i<y+3; i++)
S’
36
What is Tiling?
 Loop transformation that doubles the depth
i
M
t
for (t=0; t<N; t++)
for (i=0; i<M; i++)
S
for (t=x; t<x+3; t++)
for (i=y; i<min(y+3,M)
i<y+3; i++)
S
37
What is Tiling?
 Loop transformation that doubles the depth
for (t=0; t<N; t++)
for (i=0; i<M; i++)
S
i
for (x=0; x<N; x+=3)
for (y=0; y<M; y+=3)
S’
t
for (t=x; t<min(x+3,N); t++)
for (i=y;i<min(y+3,M);i++)
S
38
What is Tiling?
 Loop transformation that doubles the depth
for (t=0; t<N; t++)
for (i=0; i<M; i++)
S
i
for (x=0; x<N; x+=3)
for (y=0; y<M; y+=3)
for (t=x; t<min(x+3,N); t++)
for (i=y;i<min(y+3,M);i++)
S
t
39
Legality of Tiling
 Is this tiling legal?
i
t
40
Legality of Tiling
 Is this tiling legal?
i
Fully Permutable
≈
Tilable
t
41
Variations of Tiling
 Rectangular Tiles are not always legal

Tiles cannot be mutually dependent
i
i
t
t
42
Oblique Tiling
 Parallelograms

avoid mutual dependence
i
i
t
t
43
Oblique Tiling
 Parallelograms

avoid mutual dependence
i
i
t
t
44
Oblique Tiling
 Parallelograms


avoid mutual dependence
wave-front parallelism
i
i
t
t
45
Overlapped Tiling
 Pros:

Less frequent
communication
i
i
t
t
46
Tile Sizes and Shapes
 More decisions for the compiler to make


out of many variations, what to use?
what should be the size of each tile?
 Tile sizes have huge impact on performance



Easily 2-5x difference
Much much more with parallel execution
Topic of HW2
47
Expressing Tiling as Schedules
 What is the schedule for tile loops?
for (i=0; i<=N; i++)
for (j=0; j<=N; j++)
S
[N] -> { [i,j] -> [ti,tj] : 0<=i,j<=N and
ti=3i and tj=3j and 0<=ti,tj<=N};
?
for (x=0; x<=N; x+=3)
for (y=0; y<=N; y+=3)
S’
48
Expressing Tiling as Schedules
 What is the schedule for tile loops?
for (i=0; i<=N; i++)
for (j=0; j<=N; j++)
S
[N] -> { [i,j] -> [ti,tj, i',j',x,y] :
0<=i,j<=N and
ti=3x and tj=3y and 0<=ti,tj<=N and
ti<=i'<ti+3 and tj<=j'<tj+3 and
i=i' and j=j'};
?
for (ti=0; ti<=N; ti+=3)
for (tj=0; tj<=N; tj+=3)
for (t=ti; t<=min(ti+2,N); t++)
for (i=tj;i<=min(tj+2,N);i++)
S
49
Alternative Formulation
 Different view of tile origins
for (ti=0; ti<=N; ti+=3)
for (tj=0; tj<=N; tj+=3)
for (t=ti; t<=min(ti+2,N); t++)
for (i=tj;i<=min(tj+2,N);i++)
S<i,j>
for (ti=0; ti<=N/3; ti++)
for (tj=0; tj<=N/3; tj++)
for (t=0; t<3; t++)
for (i=0;i<3;i++)
S<3ti+i,3tj+j>
50
Exercises
 Using ISCC, review


array dataflow analysis
transformations
51
ADA with ISCC
 Recall the good old example:
for i = 0 .. N
for j = 0.. M
A[j] = foo(A[j], A[j+1])
 Two problems:
 A[j] <> A[j] pair
 A[j] <> A[j+1] pair
52
ADA with ISCC
 A[j] <> A[j] pair
for i = 0 .. N
for j = 0.. M
A[j] = foo(A[j], A[j+1])
reader instance as parameters
memory conflict
PRef1 := [N,M,i,j] ->
{ [i',j'] : 0<=i,i'<N and 0<=j,j'<M and j=j’
and i'<i;
[i',j'] : 0<=i,i'<N and 0<=j,j'<M and j=j’
and i'=i and j'<j};
lex order
lexmax(PRef1);
[N, M, i, j] -> { [-1 + i, j] : i <= -1 + N and
j >= 0 and j <= -1 + M and i >= 1 }
53
ADA with ISCC
 A[j] <> A[j+1] pair
for i = 0 .. N
for j = 0.. M
A[j] = foo(A[j], A[j+1])
PRef2 := [N,M,i,j] ->
{ [i',j'] : 0<=i,i'<N and 0<=j,j'<M and j=j’
and i'<i;
[i',j'] : 0<=i,i'<N and 0<=j+1,j'<M and j=j’
and i'=i and j'<j};
lexmax(PRef2);
[N, M, i, j] -> { [-1 + i, 1 + j] : i <= -1 + N and
j >= 0 and j <= -2 + M and i >= 1 }
54
ADA with ISCC 2
 Multi-statement and sloppy specification
for (i = 0 .. M)
A[i] = 0;
for (i = 0 .. N)
A[i] = 1;
for (i= 0 .. Z)
B[i] = A[i];
PRef1 := [N,M,Z,i] -> {
[0,i'] : i'=i and 0<=i<=Z
and 0<=i'<=M;
[1,i'] : i'=i and 0<=i<=Z
and 0<=i'<=N};
PRef1 := [N,M,Z,c2,i] -> {
[c,i'] : i'=i and 0<=i<=Z
and 0<=i'<=M;
[c,i'] : i'=i and 0<=i<=Z
and 0<=i'<=N};
(c=0 or c=1) and c2=2, c < c2
55
Being more Sloppy
 Using << operator in ISCC
 m3 := m1 << m2

a map from the domain of m1 to the domain of
m2 those elements such that their images live
in the same space and such that the images of
the elements of m1 are lexicographically
strictly smaller than those of m2.
 Also use lexmax on relations

gives output parameterized by the LHS
56
The << Operator Explained
 Given
 two statement domains d1 and d2
 two schedules for the statements f1 and f2

schedules in this context is the loop structure
 Create maps m1 and m2 as

f1 * d1 and f2 * d2
 Then use the << operator on the resulting
maps
 Result: a map that restrict d1 to be lex. before
d2 according to f1 and f2
57
The << Operator Example
 ADA for the following program

find producer for S0
for (i=0 .. N) {
for (j=0 .. P)
S0:
A[j] = A[j];
for (j=0 .. Q)
S1:
A[j] = A[j];
}
S1beforeS0 := (FS1*DS1) << (FS0*DS0);
S1beforeS0 := (FS0*DS0) >> (FS1*DS1);
S0beforeS0 := (FS0*DS0) >> (FS0*DS0);
conflictS0 := [N,P,Q] -> { S0[i,j] -> S0[i',j'] : j=j'};
conflictS1 := [N,P,Q] -> { S0[i,j] -> S1[i',j'] : j=j'};
DS0
DS1
FS0
FS1
:=
:=
:=
:=
[N,P,Q]
[N,P,Q]
[N,P,Q]
[N,P,Q]
->
->
->
->
{
{
{
{
S0[i,j]
S1[i,j]
S0[i,j]
S1[i,j]
: 0<=i<=N and 0<=j<=P };
: 0<=i<=N and 0<=j<=Q };
-> [i,0,j] };
-> [i,1,j] };
58
Some ISCC Operators
 You probably need:
 m1 + m2: union of two maps
 m1 * d1: intersect d1 with the domain of m1
 m1 . m2: join
 domain(m): domain of m
 range(m): range of m

coalesce x: simplify x
59
Exercise
 http://perso.ens-
lyon.fr/tomofumi.yuki/courses/exercise150930.txt
60
Systolic Arrays
 Pipeline of Processors



Each one is identical to the other
Nearest neighbor communication
Synchronized with “heart-beat”
 Advantages:

Highly parallel and scalable
 Disadvantages:

Specialized and difficult to design
61
Systolic Matrix Multiplication
 Example of 2D Systolic Array
Ain
reg
PE
Bin
mult
add
Bout
Aout
62
Systolic MM in Action
A11
A21
A31
A41
A12 A13 A14
A22 A23 A24
A32 A33 A34
A42 A43 A44
B11 B12 B13 B14
PE
PE
PE
PE
B21 B22 B23 B24
PE
PE
PE
PE
B31 B32 B33 B34
PE
PE
PE
PE
B41 B42 B43 B44
PE
PE
PE
PE
63
Systolic MM in Action
A11
A21
A31
A41
B11 B12 B13 B14
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
64
Systolic MM in Action
A11
A21
A31
B11 B12 B13
A41
PE
B14
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
65
Systolic MM in Action
A11
A21
B11 B12
A31
PE PE
B13 B14
PE
PE
A41
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
66
Systolic MM in Action
A11
B11
A21
PE PE
B12 B13
PE
B14
PE
A31
PE
PE
PE
PE
A41
PE
PE
PE
PE
PE
PE
PE
PE
67
Systolic MM in Action
A11
PE PE PE
B11 B12 B13
PE
B14
A21
PE
PE
PE
PE
A31
PE
PE
PE
PE
A41
PE
PE
PE
PE
68
Systolic MM in Action
A11
A21
A31
A41
A12 A13 A14
A22 A23 A24
A32 A33 A34
A42 A43 A44
B11 B12 B13 B14
PE
PE
PE
PE
B21 B22 B23 B24
PE
PE
PE
PE
B31 B32 B33 B34
PE
PE
PE
PE
B41 B42 B43 B44
PE
PE
PE
PE
69
Systolic MM in Action
A11
A21
A31
A12 A13 A14
A22 A23 A24
A32 A33 A34
B11 B12 B13
A41
PE
B14
A42 A43
PE PE
A44
PE
B21 B22 B23
PE
B24
PE
PE
PE
B31 B32 B33
PE
B34
PE
PE
PE
B41 B42 B43
PE
B44
PE
PE
PE
70
Systolic MM in Action
A11
A21
A12 A13 A14
A22 A23 A24
B11 B12
A31
PE
B13
A32
PE
B14
A33
PE
A34
PE
B21 B22
A41
PE
B23
A42 A43
PE PE
B24
A44
PE
B31 B32
PE
B33
PE
B34
PE
PE
B41 B42
PE
B43
PE
B44
PE
PE
71
Systolic MM in Action
A11
A21
A31
A41
A14
A13 A24
A12 A23 A34
A22 A33 A44
A32 A43
A42
PE
PE
PE
PE
PE
PE
PE
PE
B31 B32 B33 B34
PE
PE
PE
PE
B42 B43 B44
PE
PE
PE
PE
B11 B12 B13 B14
B21 B22 B23 B24
72
Systolic MM in Action
A11
A21
A31
A14
A13 A24
A12 A23 A34
A22 A33 A44
A32 A43
A42
B11 B12 B13
A41
PE
B14
PE
PE
PE
B21 B22 B23 B24
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
B31 B32 B33 B34
B41 B42 B43 B44
73
Systolic MM in Action
A11
A21
A14
A13 A24
A12 A23 A34
A22 A33 A44
A32 A43
B11 B12
A31
PE
B13
A42
PE
B14
PE
PE
B21 B22 B23
A41
PE
B24
PE
PE
PE
B31 B32 B33 B34
PE
PE
PE
PE
PE
PE
PE
PE
B41 B42 B43 B44
74
Systolic MM in Action
A11
A14
A13 A24
A12 A23 A34
A22 A33 A44
B11
A21
PE
B12
A32
PE
B13
A43
PE
B14
PE
B21 B22
A31
PE
B23
A42
PE
B24
PE
PE
B31 B32 B33
A41
PE
B34
PE
PE
PE
B41 B42 B43 B44
PE
PE
PE
PE
75
Systolic MM in Action
A13
A12 A23
A14
A24
A34
A33
PE
B13
A44
PE
B14
A11 A22
PE PE
B11 B12
B21
A21
PE
B22
A32 A43
PE PE
B23 B24
PE
B31 B32
A31
PE
B33
A42
PE
B34
PE
PE
B41 B42 B43
A41
PE
B44
PE
PE
PE
76
Systolic MM in Action
A13
A14
A24
A12 A23 A34
PE PE PE
B11 B12 B13
A11 A22 A33 A44
PE PE PE PE
B21 B22 B23 B24
PE
B31
B41 B42
A21
PE A32
PE A43
PE
B33
B34
B32
PE
A31
PE A42
PE
B43 B44
PE
PE
B14
A41
77
Automatic Synthesis
Slides from http://www.cs.colostate.edu/~cs560/Spring2013/
 Write the computation in Alpha
 Align inputs and outputs
 Serialize reductions
 Uniformize dependences
 Schedule Alpha
 Allocate computation to processors
 Transform Alpha
 Generate HDL
78
Automatic Synthesis
Slides from http://www.cs.colostate.edu/~cs560/Spring2013/
 Write the computation in Alpha
 Align inputs and outputs
 Serialize reductions
Part of scheduling
 Uniformize dependences
 Schedule Alpha
 Allocate computation to processors
 Transform Alpha
 Generate HDL
79
N
FIR Filter
 Align inputs
j
yi = å a j xi- j
j=0
yi = å
N
j=0
A[0, j]X[i - j, 0]
A[0, j] = a j
X[i, 0] = xi
i
80
Serialization
 Reduction can be computed in many different
order
 Serialization = Select an order
 Example: reduce(+, [i], {|0<=i<5} : x[i])
Σ
res
81
Serialization
 Reduction can be computed in many different
order
 Serialization = Select an order
 Example: reduce(+, [i], {|0<=i<5}: x[i])
+
+
+
+
res
82
N
FIR Filter
yi = å a j xi- j
j=0
 Serialize Reduction
j
yi = Y[i, N ]
ìï j = 0 : A[0, j]X[i - j, 0]
Y[i, j] = í
ïî j > 0 :Y[i, j -1]+ A[0, j]X[i - j, 0]
A[0, j] = a j
X[i, 0] = xi
i
83
Uniformization
 Affine dependences can be replaced by
chains of uniform dependences
 Also called localization

Localizes communication
 Example: A[i] = foo(B[0], …)
84
Uniformization
 Affine dependences can be replaced by
chains of uniform dependences
 Also called localization

Localizes communication
 Example: A[i] = foo(B[i], …)
B[i] = B[i-1];
85
N
FIR Filter
yi = å a j xi- j
j=0
 Uniformize Dependences
j
yi = Y[i, N ]
ìï j = 0 : A[0, j]X[i - j, 0]
Y[i, j] = í
ïî j > 0 :Y[i, j -1]+ A[0, j]X[i - j, 0]
A[0, j] = a j
X[i, 0] = xi
i
86
N
FIR Filter
yi = å a j xi- j
j=0
 Uniformize Dependences
j
yi = Y[i, N ]
ìï j = 0 : A[i, j]X[i - j, 0]
Y[i, j] = í
ïî j > 0 :Y[i, j -1]+ A[i, j]X[i - j, 0]
ìï i = 0 : a
j
A[i, j] = í
ïî i > 0 : A[i -1, j]
ìï j = 0 : x
i
X[i, j] = í
îï j > 0 : X[i, j -1]
i
87
N
FIR Filter
yi = å a j xi- j
j=0
 Uniformize Dependences
j
yi = Y[i, N ]
ìï j = 0 : A[i, j]X[i, j]
Y[i, j] = í
ïî j > 0 :Y[i, j -1]+ A[i, j]X[i, j]
ìï i = 0 : a
j
A[i, j] = í
ïî i > 0 : A[i -1, j]
ìï j = 0 : x
i
X[i, j] = í
ïî j > 0 : X[i -1, j -1]
i
88
N
FIR Filter
yi = å a j xi- j
j=0
 Final Dependences
j
yi = Y[i, N ]
ìï j = 0 : A[i, j]X[i, j]
Y[i, j] = í
ïî j > 0 :Y[i, j -1]+ A[i, j]X[i, j]
ìï i = 0 : a
j
A[i, j] = í
ïî i > 0 : A[i -1, j]
ìï j = 0 : x
i
X[i, j] = í
ïî j > 0 : X[i -1, j -1]
i
89
N
FIR Filter
yi = å a j xi- j
j=0
 Schedule
j
i
90
N
FIR Filter
yi = å a j xi- j
j=0
 Schedule θ(i,j) = i+j
j
i
91
N
FIR Filter
yi = å a j xi- j
j=0
 Allocate π(i,j) = j
θ(i,j) = i+j
j
i
92
N
FIR Filter
yi = å a j xi- j
j=0
 Transform
 Make i=θ, j=π
θ(i,j) = i+j
π(i,j) = j
j
i
93
N
FIR Filter
yi = å a j xi- j
j=0
 Final Equations
 Difference in:
 i = delay
 j = PE
yi = Y[i, N ]
ìï j = 0 : A[i, j]X[i, j]
Y[i, j] = í
ïî j > 0 :Y[i -1, j -1]+ A[i, j]X[i, j]
ìï i = 0 : a
j
A[i, j] = í
ïî i > 0 : A[i -1, j]
ìï j = 0 : x
i
X[i, j] = í
ïî j > 0 : X[i - 2, j -1]
94
N
FIR Filter
yi = å a j xi- j
j=0
 Final Equations
 Difference in:
 i = delay
 j = PE
yi = Y[i, N ]
ìï j = 0 : A[i, j]X[i, j]
Y[i, j] = í
ïî j > 0 :Y[i -1, j -1]+ A[i, j]X[i, j]
ìï i = 0 : a
j
A[i, j] = í
ïî i > 0 : A[i -1, j]
Need a register for delay of 2
ìï j = 0 : x
i
X[i, j] = í
ïî j > 0 : X[i - 2, j -1]
95
N
FIR Filter
yi = å a j xi- j
j=0
 Final Equations
 Difference in:
 i = delay
 j = PE
No communication
yi = Y[i, N ]
ìï j = 0 : A[i, j]X[i, j]
Y[i, j] = í
ïî j > 0 :Y[i -1, j -1]+ A[i, j]X[i, j]
ìï i = 0 : a
j
A[i, j] = í
ïî i > 0 : A[i -1, j]
ìï j = 0 : x
i
X[i, j] = í
ïî j > 0 : X[i - 2, j -1]
96
N
FIR Filter
 Generate HDL
X
a
Y
*
+
yi = å a j xi- j
j=0
yi = Y[i, N ]
ìï j = 0 : A[i, j]X[i, j]
Y[i, j] = í
ïî j > 0 :Y[i -1, j -1]+ A[i, j]X[i, j]
ìï i = 0 : a
j
A[i, j] = í
ïî i > 0 : A[i -1, j]
ìï j = 0 : x
i
X[i, j] = í
ïî j > 0 : X[i - 2, j -1]
97