an adaptive fpga implementation of multi-core k

School of Engineering
University of Guelph
Hanaa M. Hussain, Khaled Benkrid
School of Engineering
Edinburgh University, Edinburgh
Scotland, U.K.
{h.hussain, k.benkrid}@ed.ac.uk
Dunia Jamma
PhD Student
Prof. Shawki Ariebi
Course Instructor
Huseyin Seker
Bio-Health Informatics
Research Group
De Montfort University,
Leicester
England, U.K.
[email protected]
1
Outline
 Introduction
 Background of KNN
 KNN and FPGA
 The proposed architectures
 Dynamic Partial Reconfigurable (DPR) part
 The achievements
 Advantages and Disadvantages
 Conclusion
2
Introduction
 K-nearest neighbour (KNN) is a supervised classification
technique
 Applications of KNN (Data Mining, Image processing of satellite
and medical images ... etc.)
 KNN is known to be robust and simple to implement when dealing with data
of small size
 KNN performs slowly when data are large and have high dimensions
 KNN classifier is sensitive to parameter (K) Number of nearest neighbours
 The selection of the label for the new query depends on voting on those K
points.
3
1-Nearest Neighbor
3-Nearest Neighbor
4
KNN Distance Methods
 To calculate the distance between the new queries and
the K’s points the Manhattan distance was used
 The Manhattan is chosen in this work for
Xi: The new query’s matrix
Yi: The trained sample's matrix
K: # of samples
its simplicity and lower cost compared to
the Euclidean
5
KNN and FPGA
 KNN classifiers can benefit from the parallelism offered by FPGAs
 Distance computation is time consuming
 Parallelizing the distance computation part
 They propose two adaptive FPGA architectures (A1 and A2) of the KNN
classifier, and compare the implementations of each one of them with
an equivalent implementation running on a general purpose processor
(GPP)
 They propose a novel dynamic partial reconfiguration (DPR)
architecture of the KNN classifier for K
6
Used tools
 Hardware implementation:
 The hardware implementation targeted the ML 403 platform board which
has a Xilinx XC4VFX12 FPGA chip on it
 JTAG cable
 Xilinx PlanAhead 12.2 tool along with Xilinx partial reconfiguration
flow (DPR)
 Software implementation:
 Matlab (R2009b) bioinformatics toolbox
 Intel Pentium Dual-Core E5300, running at 2.60 GHz and 3 GB RAM
workstation
 Using of Verilog as HDL configuration language
7
The used data
M
N
Y=
L
Factors
M: training samples
N: Training Vectors
L: Label
Y: trained data
X: New query
X=
8
The proposed architectures
 The KNN classifier has been divided into three modular blocks
(Distance computation, KNN finder, and Query label finder) + FIFO
M-Dist PEs
K-KNN PEs
N-Dist PEs
N- KNN PEs
PE = 2N +1
A1 Architecture
A2 Architecture
PE = M + K +1
9
The functionality of PEs
Previous
accumulative
distance
Yi
Dist 2 L2
Dist1
L1
Min
Max
10
Distance computation
 The distance computations are made in parallel every
clock cycle
 The latency of Dist PE is M cycles
 A1: the throughput is one distance result every clock
cycle
A2: the throughput is one distance result every M
clock cycle
Complete Training
11
Dist PE inner architecture
12
K-Nearest Neighbour Finder
 This block becomes active after M clock cycles
The function
of this block is
completed
after an M + N
clock cycle
13
Dynamic Partial Reconfigurable part (DPR)
 The value of K parameter was dynamically reconfigured, when N, M, B, and
C are fixed for a given classification problem.
 Two cores (A1)
 Distance computation core - Static
 KNN core (KNN PE, Label PE) - Dynamic
 The size of the RP is made large enough to accommodate
the logic resources required by largest K
 Advantages: saving in reconfiguration time, Power
 Difficulties:
 Limitations (resources), the cost, the verification of the interfaces between
the static region and RP for all RMs
14
The achievement
 This DPR implementation offers 5x speed-up in the reconfiguration time of a KNN
classifier on FPGA
15
Advantages
 Variation which allows the user to select the most appropriate
architecture for the targeted application (available resources,
performance, cost)
 Enhancement in Performance
 Parallelism-speed up
 DPR-reconfigurable time
 Efficiency in term of KNN performance - the DPR for K
 Using the Manhattan’s theorem (simplicity and lower cost)
16
Disadvantages
 The amount of used resources
 The not worthy achieved speed (5X) for DPR part
comparing to the amount of used resources and effort
 Constraints in A2 architecture and the DPR (area)
 The latency due to pipelining manner of producing
the results
17
Conclusion
 Efficient design for different KNN classifier applications
 Two architectures A1 and A2 and the user can choose one of
them
 A1 can be used to target applications whereby N>>M,
whereas A2 is used to target applications whereby N<<M
 DPR part (could be reproduced with ICAP)
 Achievements comparing to GPP
 Speedup by 76X for A1 and 68X by A2
 Speedup by 5X in DPR
18
Any question?
19
Extra Slides
20
Memory
 Each FIFO is associated with one distance PE
 The query vectors gets streamed to the PEs to be stored in registers-
they will be required every clock cycle
Where:
B is the sample wordlength
M is the number of samples
N is the number of training vectors
21
Class Label Finder
 The block consists mainly of C counters each associated with
one of the class labels
 The hardware resources depends on user defined parameters
K and C
 The architecture of this block is identical in both A1 and A2
22
23
A2 Architecture
 N FIFOs are used to store the training set with each of them having a
depth of M
 The class labels get streamed and stored in registers within the distance
PEs
 A2 requires more CLB slices than A1, when N, M, and K are the same
 the first distance result becomes ready after all samples are processed
i.e., after M clock cycles
24
DPR for K
Maximum BW for JTAG is 66Mbps
Maximum BW for ICAP is 3.2Gbps
ICAP > 48x JTAG
25
Dynamic Partial Reconfigurable part
(DPR)
The JTAG was used
(BW = 66Mbps)
Using of ICAP
instead would
decrease the
configuration time
(BW = 3.2Gbps)
26
26