Pourya

A Concurrent Matrix
Transpose Algorithm
Pourya Jafari
Application
Frequently Used Linear Algebra Operation



Scientific Applications
FFT
Matrix Multiplication
Transpose Matrix
: item/cell at row i and column j of matrix B
.

For all i, j we have
.
Simply exchange rows and columns
For simplicity we only consider square matrices

N row N columns labeled 0 to N-1
An Example
Each cell is filled with row|column number
00
10
20
30
01
11
21
31
02
12
22
32
03
13
23
33
6 swaps, (4*4 – 4)/2 = 6

In general, for size N square Matrix we have
swaps,
Parallelizing
Naïve algorithm

A thread for each swap
Quadratic number of threads
Quadratic number of communication links

→ impractical
00
01
02
03
10
11
12
13
20
21
22
23
30
31
32
33
Parallelizing - 2
More efficient Way

Assign a column to each thread
00
10
20
30
01
11
21
31
02
12
22
32
03
13
23
33
O(N) threads
Communication links?

Depends on the approach
Measure dislocation
A single swap operation as row and column
shifts
00 01 02
10 11 12
20 21 22
30 31 32
For column shift length A

03
13
23
33
j= i + K → K = i - j
Shift length is i-j; value range is from 0 to N-1
Concurrency Scheme
Minimize
communication


Pre-process inside
thread
10 11 12 13
Shift each rows
20 21 22 23
Intra-process/thread
communication
Shift each columns

00 01 02 03
Post-process inside
thread
Shift each rows again
30 31 32 33
Concurrency Scheme - 2
We have the row shifts fixed based on row
index

Has range 0 to N-1,
consistent with our initial finding
Now arrange the rows, so that column
shifts gets us to i

i - L = i’
L + i’ = i
L = -j
So we shift each column j cells up
Steps so far
00 01 02 03
00
01
02
03
00 11
10 11
12 13
10
11
12
13
10 21 32 03
03 10 21 32
20 21 22 23
20
21
22
23
20 31 02 13
02 13 20 31
30 31 32 33
30
31
32
33
30 01 12 23
01 12 23 30
(1)
(2-a)
(2-b)
1 → 2: Column shift j up
2 → 3: Row shift based on row indices
3 → 4: ?

Change of indices so far
(i - j, j) → (i - j, i - j + j) → (i - j, i) = (m, n)

22 33
One operation to change row index to j
n - m = (i - (i - j))= j
00 11
22 33
(3)
00 10 20 30
01 11
21 31
02 12 22 32
03 13 23 33
(4)
Efficiency of algorithm so far
O(N) row and column operation


O(N2) overall considering both rows and
column
O(N) communication links
Communication is a major bottleneck

Group row shifts
Reduce communication and overall complexity
Radix Representation
Radix r


Base r numbers
For k each digit place (starting from LS)
For l steps from 0 to r-1


group all row shifts for current step
Radix 3
Possible numbers 0, 1 and 2

Second loop { For l=0 to 2 }
Shift all number have l in their kth digit place l*r^k to
the right
Special Case: Radix-2
Two steps only 0 and 1

We only shift for 1
Digits are bit representation

Shift all row indices have their kth bit on
0
0
0
1
1
1
2
3
Shift for each row
=
+
2
3
2
3
k=0
k=1
Algorithm complexity
Depends on r (radix)



C1=(r-1)[logrN]
C2=b(r-1)[N/r][logrN]
Special cases
r=2

Important when communication cost is high
Good when message size small
r=N

Good when message size is large
Best value based on communication costs, message size,
communication link performance, number of ports, etc.
Radix vs. message size vs. index
time for 64 processors