K-means for parallel architectures using
all-prefix-sum sorting and updating steps
SUPPLEMENTAL MATERIAL
Helper kernels
Data reduction kernel
The data reduction kernel takes an array of integer or floating-point numbers in shared memory and returns the total
sum over its values at array position 0. Reduction of these values is done collaboratively by several parallel threads .
We implemented a variant of the reduction algorithm described in [11]. The reduction algorithm in that paper
divides a data segment of size E into subsegments of size 2 * tpb. Each subsegment is then reduced collaboratively
by a block of threads. Starting with tpb active threads at the first iteration, each thread sums up two numbers,
effectively halving the number of elements in the arra. The number of active threads is halved as well and the
process repeated until the last active thread forms the final sum. Results per block are stored in an array that is then
further reduced in the same fashion by further kernel calls, as required. Due to the increase in idle threads per
iteration, the number of active threads per block for this technique is on average below
3
, and () + 1
iterations are required to reduce a data segment of size 2 * tpb.
In contrast, our variation to this algorithm does an initial reduction by moving over the E elements in the array
with a step size of tpb. Each thread reads one of the tpb elements and adds it to a temporary storage array of size tpb
in shared memory. After iterations, we have thus reduced E elements totpb partial sums. Only these tpb partial
sums are then reduced according to the algorithm described in the previous paragraph using() iterations.
Since we use all threads in a block during the initial reduction, this gives us a close to 100% thread utilization as
long as E is sufficiently large.
Since the final step of our reduction is done for a constant of tpb data elements, we implemented a template
__device__ kernel reduceOne for use by calls from within other kernels. The code (cf. [11]) is reproduced in Listing
SM1. Code listings for both pseudo code and actual CUDA source code are provided as supplemental material. For
efficiency, the loop in this reduction kernel was completely unrolled.
We also implemented an expansion of this kernel, called reduceTwo, which performs the same function, but on
two shared-memory arrays.
All-prefix-sum kernel
As a key element of the sorting and updating stages we implemented parallel prefix-sum introduced in [12] and
presented as a CUDA implementation in [13]. The latter was modified and simplified to increase efficiency for our
purposes. This kernel type takes a shared memory array of integer or floating point values. The first part of the
algorithm is identical to the data reduction kernel, but the partial sums that result in all locations of the output array
are used for computing the final running sum.
To understand the kernel, we will employ the familiar tree concept. The algorithm behind the data reduction
kernel can be thought of as a balanced binary tree with tpb leaves. After the reduction, each parent node contains the
sum of its two child nodes, and the value in each node is equal to the sum of the values of all leaf nodes beneath it.
As we ascend from the leaves, the number of nodes at each level is half that of the previous level. To save space, we
do the summation in-place, and therefore overwrite half of the locations in our array with new partial sums at each
step. This is to say, we preserve all values for the right children, but overwrite those for the left children (note that
given the properties described above, it is always possible to reconstruct the value of the left children by subtracting
the right child from its parent node). When the reduction part has completed, we make a copy of the final sum, then
reuse the right children to construct the running sum.
The final array does not just contain the final sum, but also the sums for all right subtrees in the tree. This
includes the original values for all leaves that are the right children of their parent nodes. Consider now that we carry
out a depth-first search through the tree and form a sequence of all leaves in the order in which they are visited. To
obtain the running sum, we want each leaf to contain the sum of all leaves that follow it in this sequence. This is
equivalent to saying that each leaf should contain the sum over all right subtrees that are encountered on the path
from root to leaf.
Since we have all the sums of the right subtrees readily available, this only requires one more pass through the
tree, this time descending from root to leaf. The root starts by copying the sum for its right subtree to its left child
and then resets the right child to 0 (since there are no right subtrees on the path from root to its right child). While
descending further, each of the other parent nodes now propagates its own value (i.e. the sum over all previously
encountered right subtrees) to its children and adds the sum for its own right subtree to its left child. To avoid having
to treat the root as a special case, it is sufficient to initially set its value to 0 (equivalent to the situation that
previously encountered right subtrees sum to 0).
We propagate downwards according to the following rules:
1)
2)
3)
4)
5)
start at 0-initialized root
store value of parent
add value of right child to left child
assign stored value of parent to right child
continue from 2) for each child node that is not a leaf
Together with the previously saved total sum, this gives us the running sum over all elements, starting at 0.
The source code for this second part closely resembles the data reduction kernel, but in reversed order. Starting
with just one thread, the number of active threads is doubled during each iteration. The complete kernel is shown in
Listing SM2 and an example is given in Fig SM1. To maintain efficient shared-memory accesses, we make sure that
threads always read and write from adjacent memory locations.
The strategy of always accessing adjacent memory locations by the threads in a warp guarantees bank conflictfree shared memory accesses for all 32-bit data types, such as floats and ints, on CUDA. For 64-bit data types, such
as doubles, 2-way bank conflicts occur.
Reduction
4
Propagation
v2 +
v4
v3
v4
v1 +
v3
v2 +
v4
v3
v1
v2
v3
1
0
v2+
v4
v3
v4
v4
v2 +
v4
0
V3
v4
v4
v2 +
v3 +
v4
v4
v2+
v4
0
t
t
Fig. SM1: The example shows a four-value numeric array at six different time points of all-prefix-sum, progressing clockwise
beginning at the lower left. Starting with the initial array (bottom left) the array first undergoes a two-step reduction (left side,
ascending). This is schematically indicated by a binary tree structure mapped on top of the array, where two child nodes are
connected to a parent node (note that parent node and left child share an array location, with the parent node overwriting the child
node after the child’s value has been read). Array location 1 is overwritten once and location 0 twice. After clearing location 0 (top
right), the sums over the right subtreesare then propagated downwards (note that the right child of the root is stored at location 1
and has the value v2 + v4 before we start descending; notefurther that parent node and left child again share an array location, with
the left child now overwriting the parent node once the parent’s value has been read).The final array contains the running sum
(bottom right). Each thread is responsible for the two children of a node. The lines in the lower level of the tree overlap. This is
because as a design principle we require to access adjacent array locations to avoid bank conflicts. For example, two threads first
concurrently read v1 and v2 from the initial array, and then v3 and v4 before writing the partial sums to array locations 0 and 1.
We note that while we obtain all elements of the running sum, they are out of order, or unsorted. This is not a
problem, because, as will be explained, we use the running sum only to compute consecutive indices for global
memory arrays where the order does not matter as long as the accesses are coalesced.
Data compaction kernel
The data compaction kernel serves to pick out a number of scattered elements from an input array in global memory
and write them to consecutive output locations in a separate array, also in global device memory. Our parallel data
compaction uses all-prefix-sum to determine T consecutive storage positions for a subset of T out of tpb threads. An
example of this is presented in Fig. SM2. Thealgorithm works as follows: Each thread t that requires a storage
position sets a specific location in a 0-initialized shared memory array to 1. After running the all-prefix-sum kernel
over the array, all these locations that had a 1 now each have a different integer value between 0 and T-1.
The last element of the running sum is T. This can be used as an offset, so when we compute the next set of
indices, it is possible to append to the storage adjacent to the last written value. An offset counter can keep track of
the sum of all T’s from previous iterations, so that the next empty position is always known. This way, it is possible
to iterate over a large array, collect selected values from windows of size tpb, store them in consecutive positions of
a second array, and then move the offset forward. Rather than reading and writing input at each iteration (which
would lead to idle threads), we use this method to buffer at least tpb array indices in shared memory and then carry
out global memory accesses for a large number of data points at once.
Assignments
Threads
2
0
Index array
4
1
0
4
2
1
0
3
4
4
0
1
5
5
1
4
6
2
7
0
1
0
3
3
4
All-prefix-sum kernel
Index array
0
0
1
2
2
Threads
0
1
2
3
4
5
6
7
Input
d0
d1
d2
d3
d4
d5
d6
d7
Output
d1
d2
d4
d6
T
4
Fig.SM2: A simple example with eight threads demonstrating the principle behind the data compaction kernel. The schema
progresses from top to bottom as follows: A number of threads (here 8) read assignments of data points to clusters from an array.
The task is to gather all those data points that are assigned to cluster 4. Each thread that encounters a corresponding assignment (a
4) writes a ‘1’ to an index array. Next, the index array is passed through the all-prefix-sum kernel, which computes the running sum.
Note that the sum is incremented by one immediately after it encounters an array position that is set to1. The kernel also outputs the
final sum T, which equals the number of threads that have requested an index - a nontrivial task in a parallel environment. Only the
requesting threads then read a number from the index array and use that to copy data points di from the input array to the output
array, where i is the thread number; 1, 2, 4, or 6 in this case. The value of T tells us that the next empty position in the output array
is 4.
Distance metric kernel
To provide flexibility and improve the usefulness of the method, we have made the K-means implementation
compatible with a variety of distance metrics.
As a key element of our design strategy for efficient processing of data vectors that have too many dimensions to
fit into shared memory, distance metrics were separated into two kernels: First, a kernel distanceComponent that
computes the contribution to the distance along a single dimension, and second, a kernel distanceFinalize that
completes the metric across dimensions.For example, for Euclidean distance, distanceComponent computes the
squared distance between the coordinates of two vectors in a given dimension. Looping over all dimensions, results
can be stored in a vector of length D and processed by distanceFinalize to sum up the components before taking the
square root.
Alternatively, the contributions from distanceComponent can already be accumulated in a single shared memory
location while looping over all dimensions before calling distanceFinalize to take the square root of this preprocessed ‘vector’ of length 1. This makes it possible to have tpb threads computing distances for tpb data points
using a fixed amount of tpb memory locations.
A similar strategy is used in [4], but unlike our own, data is assigned to thread blocks in chunks and all centroids
reloaded for each chunk. We only load each centroid once for each block of threads.
This procedure works for a wide range of distance metrics, including Euclidean, Manhattan, and Chebyshev. For
our performance measurements, we used a squared Euclidean distance for which distanceFinalize does not take the
square root.
APPENDIX A: FURTHER IMPROVEMENTS TO PERFORMANCE AND MEMORY USAGE USING TRIANGLE
INEQUALITY AND PAGING.
The performance-limiting stage of Kps-means, assigning data to clusters, can be further accelerated for both CPU
and GPU code by use of triangle inequality. This property allows, for any data point at distance d to its centroid C,
to check only centroids closer to C than d. Using it requires precomputing a distance matrix between centroids,
which can be included relatively easily by appropriate modification of the kernels.
It is also possible to extend the method to data sets of size beyond the physical memory limit. The data set size is
curently limited by the available global device memory, e.g. 4 GiB for a Tesla T10. We found that moving one GiB
of data from host to device for this architecture requires ca. 770 ms. For paging, one needs to take these costs into
account.
Sorting and updating can be carried out on individual pages, which then also benefit from the improved time
complexity of O(D*N). We would expect that no data transfer between pages would be required if the initial
distribution of data points is sufficiently random. Sorting within pages then leads to large enough segments to
provide extensive load for all blocks of threads.
All four stages, computation of centroids, assigning data points to centroids, sorting/updating, and computing the
score, have to access each data point and thus have to visit each page. Leaving the last page in global memory
between stages would require a constant amount of time per iteration tp for exchanging pages that can be estimated
by
= [(1 + 2 ) ( 1) + 2 ] 770
where c1 is the number of active stages 1, 2, and 4 (i.e., 3 or 2 when including or excluding score computation,
respectively), c2 is 1 if sorting is active (which requires transferring modified pages from device back to host) and 0
otherwise, p the number of pages occupied by the clustering data, and xp the size of each page in GiB.
Dimension
Assignmentsk
Data array
d = 0
d
0 1 2 3 4 5 6 7 8 9 …
N-1
3 2 1 2 1 0 2 1 3 2 0 0 1 1 3 1
|
|
|
| |
|
x
xxxxx
x
Shared memory
2
4
7
12
Threads
t0
t1
t2
t3
= 1
xxxxx
…
d = D-1
…
x
xxxxx
+d*N
Block k = 1
offset0 offset1
|
|
x xxx
Buffer array
offset2
|
offset3
|
Assigned to
k = 0
k = 1
k = 2
k = 3
Segment size
3
6
4
3
New Assignm.
offset1+N
|
x xxx
…
offset1+(D-1)*N
|
x xxx
0 0 0 1 1 1 1 1 1 2 2 2 2 3 3 3
Fig.SM3: The simplified schematic shows how data is copied from the unsorted data array to the sorted buffer array, both in global
memory. The example is for the case N=16, K=4, an arbitrary D, and four threads per block (tpb). Sorting is demonstrated for the
block with k=1, three more blocks gather data for the other segments in parallel. The number of assignments for each cluster has
already been counted, i.e. the segment size is known, and has been used to determine the segment offsets. Furthermore, the
assignment array has been traversed using data compaction and the first tpb indices for data assigned to cluster k=1 have been
written to a shared memory array. Each thread in the block reads one of the scattered elements and writes it to an adjacent location
in the buffer array that has been determined using the all-prefix-sum kernel. The block of threads then shifts along the data array by
N locations to copy the next component of the data points to the sorted buffer using the known indices. This is repeated for each
dimension (allowing for arbitrary large D). Once complete, the block returns to the assignment array to continue looking for the next
set of up to tpb indices for remaining data points (allowing for arbitrary large N). While copying, the threads don’t store any
intermediate data, the copy operation is happening directly from data array to buffer array, writing is coalesced. While the access
pattern is deterministic, it is not necessarily in the order shown in the schematic. Allowing for out-of-order accesses by the threads in
a block and no fixed within-segment order of data points improves the running time of the algorithm. Data and buffer array are of
size O(D*N), the assignment array of size O(N).
Table SM1: Comparison of different implementations of K-means. Parameter combinations for which the GPUs start outperforming
MatLab (light orange), MatLab and R (light gray), and all of MatLab, R, and the CPU (yellow) were highlighted. Where table cells
contain two numbers, the upper is the time for execution in seconds, and the lower the speedup relative to MatLab.
D
1
50
100
N
500
1000
2000
4000
8000
12000
16000
500
1000
2000
4000
8000
12000
16000
500
1000
2000
4000
8000
12000
16000
MatLab
0.117 (1x)
0.212 (1x)
0.192 (1x)
0.14 (1x)
0.511 (1x)
1.153 (1x)
2.636 (1x)
0.527 (1x)
1.847 (1x)
7.584 (1x)
22.707 (1x)
86.336 (1x)
15.545 (1x)
32.387 (1x)
0.896 (1x)
3.862 (1x)
19.093 (1x)
51.416 (1x)
857.14 (1x)
765.259 (1x)
33.886 (1x)
R
0.005 (23x)
0.004 (53x)
0.026 (7x)
0.088 (2x)
0.204 (3x)
0.31 (4x)
0.406 (6x)
0.031 (17x)
0.078 (24x)
0.199 (38x)
0.926 (25x)
1.612 (54x)
4.573 (3x)
7.082 (5x)
0.05 (18x)
0.123 (31x)
0.458 (42x)
1.456 (35x)
3.457 (248x)
5.227 (146x)
15.854 (2x)
CPU
0.001 (117x)
0.001 (212x)
0.003 (64x)
0.02 (7x)
0.07 (7x)
0.122 (9x)
0.244 (11x)
0.012 (44x)
0.023 (80x)
0.048 (158x)
0.864 (26x)
1.421 (61x)
4.073 (4x)
13.832 (2x)
0.024 (37x)
0.048 (80x)
0.093 (205x)
1.182 (43x)
10.7 (80x)
6.534 (117x)
14.089 (2x)
GPU (Kps-means)
Nvidia GT 9600M
0.007 (17x)
0.01 (21x)
0.026 (7x)
0.037 (4x)
0.089 (6x)
0.128 (9x)
0.236 (11x)
0.052 (10x)
0.102 (18x)
0.132 (57x)
0.431 (53x)
0.573 (151x)
1.576 (10x)
2.372 (14x)
0.087 (10x)
0.152 (25x)
0.263 (73x)
0.52 (99x)
2.502 (343x)
2.011 (381x)
2.767 (12x)
NvidiaTesla T10
0.003 (39x)
0.003 (71x)
0.007 (27x)
0.008 (17x)
0.013 (39x)
0.017 (68x)
0.027 (98x)
0.019 (28x)
0.029 (64x)
0.042 (181x)
0.095 (239x)
0.088 (981x)
0.209 (74x)
0.30 (107x)
0.03 (30x)
0.042 (92x)
0.084 (227x)
0.116 (443x)
0.433 (1980x)
0.27 (2834x)
0.369 (92x)
Listing SM1: Collaborative multi-thread reduction algorithm.
do in parallel
nThreads<- 2*threadsPerBlock
nIterations<- log2(nThreads) + 1
repeat
nThreads<- nThreads / 2
DATAt<- DATAt + DATAt+nThreads, t = 0,…,nThreads-1
synchronize threads
until nThreads = 1
broadcast sum from DATA[0]
end do
return sum
template<unsigned intblockSize, class T>
__device__ static void reduceOne(inttid, T *s_A)
{ // loop fully unrolled
if (blockSize>= 1024) { if (tid < 512) { s_A[tid] += s_A[tid + 512]; } __syncthreads(); }
if (blockSize>=
512) { if (tid < 256) { s_A[tid] += s_A[tid + 256]; } __syncthreads(); }
if (blockSize>=
256) { if (tid < 128) { s_A[tid] += s_A[tid + 128]; } __syncthreads(); }
if (blockSize>=
128) { if (tid <
64) { s_A[tid] += s_A[tid +
if (tid < 32){ // no need to synchronize for last 32 elements
if (blockSize>= 64) { s_A[tid] += s_A[tid + 32]; }
if (blockSize>= 32) { s_A[tid] += s_A[tid + 16]; }
if (blockSize>= 16) { s_A[tid] += s_A[tid +
8]; }
if (blockSize>=
8) { s_A[tid] += s_A[tid +
4]; }
if (blockSize>=
4) { s_A[tid] += s_A[tid +
2]; }
if (blockSize>=
2) { s_A[tid] += s_A[tid +
1]; }
}
}
64]; } __syncthreads(); }
Listing SM2: Parallel all-prefix-sum scan.
do in parallel
nThreads<- 2*threadsPerBlock
nIterations<- log2(nThreads) + 1
tid<- thread ID
repeat
nThreads<- nThreads / 2
DATAtid<- DATAtid + DATAtid+nThreads
synchronize threads
until nThreads = 1
broadcast sum from DATA[0]
if first thread
then DATA[0] = 0
end if
repeat
temptid<- DATAtid
DATAtid<- DATAtid + DATAtid+nThreads
DATAtid<- temptid
synchronize threads
until nThreads <- 2*threadsPerBlock
end do
return sum
template<unsigned intblockSize>
__device__ static intparallelPrefixSum(inttid, int *DATA){
unsignedint temp = 0;
unsignedint sum
= 0;
unsignedint n
= 2 * blockSize;
if (n >= 1024) { if (tid < 512) { DATA[tid] += DATA[tid + 512]; } __syncthreads(); }
if (n >=
512) { if (tid < 256) { DATA[tid] += DATA[tid + 256]; } __syncthreads(); }
if (n >=
256) { if (tid < 128) { DATA[tid] += DATA[tid + 128]; } __syncthreads(); }
if (n >=
128) { if (tid <
64) { DATA[tid] += DATA[tid +
if (tid <
32) DATA[tid] += DATA[tid + 32];
if (tid <
16) DATA[tid] += DATA[tid + 16];
if (tid <
8) DATA[tid] += DATA[tid +
8];
if (tid <
4) DATA[tid] += DATA[tid +
4];
if (tid <
2) DATA[tid] += DATA[tid +
2];
if (tid <
1) DATA[tid] += DATA[tid +
1];
64]; } __syncthreads(); }
__syncthreads();
sum = DATA[0];
__syncthreads();
if (tid == 0) DATA[0] = 0;
if (tid <
1) { temp = DATA[tid]; DATA[tid] += DATA[tid +
1]; DATA[tid +
1] = temp; }
if (tid <
2) { temp = DATA[tid]; DATA[tid] += DATA[tid +
2]; DATA[tid +
2] = temp; }
if (tid <
4) { temp = DATA[tid]; DATA[tid] += DATA[tid +
4]; DATA[tid +
4] = temp; }
if (tid <
8) { temp = DATA[tid]; DATA[tid] += DATA[tid +
8]; DATA[tid +
8] = temp; }
if (tid <
16) { temp = DATA[tid]; DATA[tid] += DATA[tid +
16]; DATA[tid +
16] = temp; }
if (tid <
32) { temp = DATA[tid]; DATA[tid] += DATA[tid +
32]; DATA[tid +
32] = temp; }
__syncthreads();
if (n >=
128) { if (tid <
if (n >=
256) { if (tid < 128) { temp = DATA[tid]; DATA[tid] += DATA[tid + 128]; DATA[tid + 128] = temp; } __syncthreads(); }
64) { temp = DATA[tid]; DATA[tid] += DATA[tid +
64]; DATA[tid +
64] = temp; } __syncthreads(); }
if (n >=
512) { if (tid < 256) { temp = DATA[tid]; DATA[tid] += DATA[tid + 256]; DATA[tid + 256] = temp; } __syncthreads(); }
if (n >= 1024) { if (tid < 512) { temp = DATA[tid]; DATA[tid] += DATA[tid + 512]; DATA[tid + 512] = temp; } __syncthreads(); }
return sum;
}
Listing SM3: Kernel for computation of new centroid positions.
do in parallel
nThreads<- threadsPerBlock
k <- block ID
tid<- thread ID
NUMPOINTSARRAYtid<- 0
for each dimension d [0…(D-1)]
CENTERPARTStid<- 0
for each SEGMENT of size nThreads in DATAd
ifsegmenttid assigned to k
then CENTERPARTStid<- CENTERPARTStid + SEGMENTtid
if d = 0
then NUMPOINTSARRAYtid<- NUMPOINTSARRAYtid + 1
end if
end if
end for
if first dimension
then NUMPOINTSARRAYtid<- NUMPOINTSARRAYtid + 1
end if
synchronize threads
if first dimension
then CENTERSUM, POINTSUM <- reduceTwo(CENTERPARTS, NUMPOINTSARRAY)
elseCENTERSUM <- reduceOne(CENTERPARTS)
end if
if first thread and POINTSUM > 0
thenNEWCENTERk,d<- CENTERSUM / POINTSUM
end if
end for
end do
__global__ static void calcCenters(int N, int K, int D, float *XPOS, float *CTRPOS, int *ASSIGN)
{
extern __shared__ float array[];
int
*s_numElements = (int*)
array;
float *s_centerParts = (float*) &s_numElements[blockDim.x];
int k
= blockIdx.x;
inttid = threadIdx.x;
floatclusterSize = 0.0;
s_numElements[tid] = 0;
for (unsigned int d = 0; d < D; d++){
s_centerParts[tid] = 0.0;
unsignedint offset = tid;
while (offset < N){
if (ASSIGN[offset] == k){
s_centerParts[tid] += XPOS[d * N + offset];
if (d == 0) s_numElements[tid]++;
}
offset += blockDim.x;
}
__syncthreads();
if (d == 0){
reduceTwo<threadsPerBlock>(tid, s_centerParts, s_numElements);
if (tid == 0) clusterSize = (float) s_numElements[tid];
}
else{
reduceOne<threadsPerBlock>(tid, s_centerParts);
}
if (tid == 0) if (clusterSize> 0) CTRPOS[k * D + d] = s_centerParts[tid] / clusterSize;
}
}
Listing SM4: Kernel calcScore for computing the within cluster squared sum as convergence criterium.
do in parallel
nThreads<- threadsPerBlock
k <- block ID
tid<- thread ID
SCOREStid<- 0
for each DATASEGMENT of size nThreads in DATA
distance<- 0
for each CENTROIDSEGMENT of size nThreads in CENTERS
ifDATASEGMENTtid assigned to k
thenfor each c in CENTROIDSEGMENT
distance<- distance + distanceComponent(c, DATASEGMENTtid)
end for
end if
end for
SCOREStid<- SCOREStid + distanceFinalize(1, distance)
end for
synchronize threads
score<- reduceOne(SCORES)
if first thread
thenGLOBALSCOREk<- score
end if
end do
__global__ static void calcScore(int N, int K, int D, float *XPOS, float *CTRPOS, int *ASSIGN, float *SCORE)
{
extern __shared__ float array[];
float *s_scores = (float*) array;
float *s_center = (float*) &s_scores[blockDim.x];
int k
= blockIdx.x;
inttid = threadIdx.x;
s_scores[tid] = 0.0;
unsignedintoffsetN = tid;
while (offsetN< N){
floatdist = 0.0;
unsignedintoffsetD = 0;
while (offsetD< D){
if (offsetD + tid< D) s_center[tid] = CTRPOS[k * D + offsetD + tid];
__syncthreads();
if (ASSIGN[offsetN] == k){
for (unsigned int d = offsetD; d < min(offsetD + blockDim.x, D); d++){
dist += distanceComponent(s_center + (d - offsetD), XPOS + (d * N + offsetN));
}
}
offsetD += blockDim.x;
__syncthreads();
}
s_scores[tid] += distanceFinalize(1, &dist);
offsetN += blockDim.x;
}
__syncthreads();
reduceOne<threadsPerBlock>(tid, s_scores);
if (tid == 0) SCORE[k] = s_scores[tid];
}
© Copyright 2026 Paperzz