FFT Accelerator Project Rohit Prakash (2003CS10186) Anand Silodia (2003CS50210) September 27th,2007 Overview • Multiprocessor Implementation – Problems faced – Solutions – Results • FPGA IO – Work done – Problems faced – Possible solutions MultiprocessorFFT: Problems • The previous code worked for some inputs but not all • The program seemed to communicate well but still error prone • Lots of segmentation faults (even after getting the results) – Serial debugger does not work – Commercial debuggers available, but evaluation is restricted to single IP, 30 days Suggested solutions (lam-mpi/google groups) • “Execution Environment does not match the compile environment” • Same code worked with MPICH version 2, GCC • Complex datatype NOT supported in C version (but MPI_2COMPLEX seemed to work for me) • Finally changed the code in C++ using complex <float> and MPI::COMPLEX (this worked) System Info (Identical for all) • • • • • Machine 1: Saveri Machine 2: Abhogi Machine 3: Sahana Machine 4: Jaunpuri Sysinfo : – – – – – – – Intel Pentium 4, 3.4 GHz Cache Size: 2048KB RAM 1GB Operating System : Fedora Core 6 Compiler : mpic++ Flags: -O3 –march=pentium4 FFT : radix 2 Theoretical Execution time • For p processors, the total execution time is : (TN/p) + (1 – 1/p)(2N/B + KN) • p is a power of 2 • TN is the time taken to compute the FFT of input size N • KN is the time taken to combine two Npoint FFT’s • B is the network bandwidth (bytes/sec) Nature of this function • Sum of two functions – – (TN/p) – (1 – 1/p)(2N/B + KN) • When (TN/p) dominates • When (1 – 1/p)(2N/B + KN) dominates Input: 8388608 real time vs. # processors 10 #processors 9.12 8 6.73 6.44 6 real time 4 2 0 0 1 2 time (s) 4 Input: 8388608 #processors = 4 15% communication execution time 85% Input: 8388608 #processors = 2 41% communication 59% execution time Input: 16777216 real time vs. # processors 20 18.05 time (s) 15 13.69 13.48 10 real time 5 0 0 1 2 #processors 4 Input: 16777216 #processors = 4 16% communication execution time 84% Input: 16777216 #processors = 2 31% communication execution time 69% Input: 33554432 time (s) real time vs. #processors 90 80 70 60 50 40 30 20 10 0 83.17 57.56 51.1 0 1 real time 2 #processors 4 Input: 33554432 #processors = 4 44% communication 56% execution time Input: 33554432 #processors = 2 43% 57% communication execution time Input: 67108864 time vs. #processors 3000 time (s) 2500 2000 1500 real time 1000 500 0 0 1 2 # processors 4 Input: 67108864 #processors = 4 12% communication execution time 88% Input: 67108864 #processors = 2 11% communication execution time 89% Inference • Input of 33554432 is a kind of breakeven point (thereafter we start getting speedup) • Below this point – the execution time increases with the increase in # processors – the %age communication time decreases as the #processors increase • Above this point – the execution time decreases with the increase in #processors – the %age communication time increases as the #processors decreases Possible errors • Measuring real time which is affected by the load on a particular processor • Network Communication latency affects the time taken to establish a synchronous handshake • The pipeline is actually not “perfect” 4 processor pipelined layout P4 Recv(2) FFT(N/4) Recv(1) FFT(N/4) P3 P2 Recv(1) Send(4) FFT(N/4) P1 Send(2) Send(3) FFT(N/4) (N/2B) (N/4B) (TN/4) Send(2) Send(1) Recv(4) Combine Recv(3) Combine (N/4B) (KN/4B) Time taken by these can surpass the boundaries Send(1) Recv(1) Combine (N/2B) (KN/2B) Further Work • Rewrite the code with new data type in C • Optimize the code • Try with more processors ? • Analyze using profilers ? FPGA: PCI IO • Built and ran admxrc2 demos • Studied the wrapper and vhdl codes • Struct ADMXRC2_SPACE_INFO – The VirtualBase member is the address, in the application's address space, by which the region may be accessed using pointers. Mapping to logical space • All the demo vhdl codes have been written using the names of the standard card signals as inputs and outputs • This approach makes the vhdl code card-dependent FPGA: Next step • There exists another approach that uses ADMXRC2_Read and ADMXRC2_Write API calls • See which of the two approaches is more useful and work with it • DMA code of Parikshit Patidar (work on Hardware Accelerator for Ray Tracing) References • ADM-XRC-II user manual • www.forums.xilinx.com • www.fpga-faq.org Thank you
© Copyright 2026 Paperzz