FFT Accelerator Project

FFT Accelerator Project
Rohit Prakash (2003CS10186)
Anand Silodia (2003CS50210)
September 27th,2007
Overview
• Multiprocessor Implementation
– Problems faced
– Solutions
– Results
• FPGA IO
– Work done
– Problems faced
– Possible solutions
MultiprocessorFFT: Problems
• The previous code worked for some
inputs but not all
• The program seemed to communicate
well but still error prone
• Lots of segmentation faults (even after
getting the results)
– Serial debugger does not work
– Commercial debuggers available, but
evaluation is restricted to single IP, 30 days
Suggested solutions
(lam-mpi/google groups)
• “Execution Environment does not
match the compile environment”
• Same code worked with MPICH version
2, GCC
• Complex datatype NOT supported in C
version (but MPI_2COMPLEX seemed to
work for me)
• Finally changed the code in C++ using
complex <float> and MPI::COMPLEX
(this worked)
System Info (Identical for all)
•
•
•
•
•
Machine 1: Saveri
Machine 2: Abhogi
Machine 3: Sahana
Machine 4: Jaunpuri
Sysinfo :
–
–
–
–
–
–
–
Intel Pentium 4, 3.4 GHz
Cache Size: 2048KB
RAM 1GB
Operating System : Fedora Core 6
Compiler : mpic++
Flags: -O3 –march=pentium4
FFT : radix 2
Theoretical Execution time
• For p processors, the total execution
time is :
(TN/p) + (1 – 1/p)(2N/B + KN)
• p is a power of 2
• TN is the time taken to compute the FFT
of input size N
• KN is the time taken to combine two Npoint FFT’s
• B is the network bandwidth (bytes/sec)
Nature of this function
• Sum of two functions –
– (TN/p)
– (1 – 1/p)(2N/B + KN)
• When (TN/p) dominates
• When (1 – 1/p)(2N/B + KN) dominates
Input: 8388608
real time vs. # processors
10
#processors
9.12
8
6.73
6.44
6
real time
4
2
0
0
1
2
time (s)
4
Input: 8388608
#processors = 4
15%
communication
execution time
85%
Input: 8388608
#processors = 2
41%
communication
59%
execution time
Input: 16777216
real time vs. # processors
20
18.05
time (s)
15
13.69
13.48
10
real time
5
0
0
1
2
#processors
4
Input: 16777216
#processors = 4
16%
communication
execution time
84%
Input: 16777216
#processors = 2
31%
communication
execution time
69%
Input: 33554432
time (s)
real time vs. #processors
90
80
70
60
50
40
30
20
10
0
83.17
57.56
51.1
0
1
real time
2
#processors
4
Input: 33554432
#processors = 4
44%
communication
56%
execution time
Input: 33554432
#processors = 2
43%
57%
communication
execution time
Input: 67108864
time vs. #processors
3000
time (s)
2500
2000
1500
real time
1000
500
0
0
1
2
# processors
4
Input: 67108864
#processors = 4
12%
communication
execution time
88%
Input: 67108864
#processors = 2
11%
communication
execution time
89%
Inference
• Input of 33554432 is a kind of
breakeven point (thereafter we start
getting speedup)
• Below this point
– the execution time increases with the
increase in # processors
– the %age communication time decreases
as the #processors increase
• Above this point
– the execution time decreases with the
increase in #processors
– the %age communication time increases as
the #processors decreases
Possible errors
• Measuring real time which is affected
by the load on a particular processor
• Network Communication latency affects
the time taken to establish a
synchronous handshake
• The pipeline is actually not “perfect”
4 processor pipelined layout
P4
Recv(2) FFT(N/4)
Recv(1) FFT(N/4)
P3
P2 Recv(1) Send(4) FFT(N/4)
P1 Send(2) Send(3) FFT(N/4)
(N/2B)
(N/4B)
(TN/4)
Send(2)
Send(1)
Recv(4) Combine
Recv(3) Combine
(N/4B)
(KN/4B)
Time taken by these can surpass the boundaries
Send(1)
Recv(1) Combine
(N/2B) (KN/2B)
Further Work
• Rewrite the code with new data type in
C
• Optimize the code
• Try with more processors ?
• Analyze using profilers ?
FPGA: PCI IO
• Built and ran admxrc2 demos
• Studied the wrapper and vhdl codes
• Struct ADMXRC2_SPACE_INFO
– The VirtualBase member is the address, in the
application's address space, by which the region
may be accessed using pointers.
Mapping to logical space
• All the demo vhdl codes have been
written using the names of the
standard card signals as inputs and
outputs
• This approach makes the vhdl code
card-dependent
FPGA: Next step
• There exists another approach that
uses ADMXRC2_Read and
ADMXRC2_Write API calls
• See which of the two approaches is
more useful and work with it
• DMA code of Parikshit Patidar (work on
Hardware Accelerator for Ray Tracing)
References
• ADM-XRC-II user manual
• www.forums.xilinx.com
• www.fpga-faq.org
Thank you