SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures

IBM Research – Tokyo
SIMD- and Cache-Friendly Algorithm
for Sorting an Array of Structures
Hiroshi Inoue†‡ and Kenjiro Taura‡
† IBM Research – Tokyo
‡ University of Tokyo
September 3, 2015
© 2015 IBM Corporation
IBM Research – Tokyo
Our Goal
 Develop a fast algorithm for sorting an array of structures
struct Record {
int32_t key;
int32_t dataX;
int32_t dataY; payload
...
};
struct Record array[N];

key payload key payload
record i

record i +1
Our approach:
 To use SIMD-based multiway mergesort as the basis
 To exploit SIMD instructions to accelerate mergesort kernel
 To avoid random memory accesses that cause excessive
cache misses
2
SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures
© 2015 IBM Corporation
IBM Research – Tokyo
Outline
Goal
 Background: SIMD multiway mergesort for integers
 Performance problems in existing approaches
 Our approach
 Performance results
3
SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures
© 2015 IBM Corporation
IBM Research – Tokyo
Background:
Mergesort with SIMD for sorting integers
[Inoue et al. PACT 2007, Chhugani et al. PVLDB 2008 etc.]
 Merge operation for 32-bit or 64-bit integers can be
efficiently implemented with SIMD by:
– merging multiple values in vector registers using SIMD min
and max instructions (i.e. without conditional branches)
– integrating the in-register vector merge into usual
comparison-based merge operation
 SIMD min and max instructions can accelerate sorting by
– parallelizing comparisons
– avoiding unpredictable conditional branches
4
SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures
© 2015 IBM Corporation
IBM Research – Tokyo
Background:
SIMD-based merge for values in two vector registers
sorted
1
4
sorted
7
<
8
2
<
3
<
2
5
4
sorted
6
<
<
3
5
<
<
1
Input
two vector registers contain
four presorted values in each
6
input
<
stage 2
<
stage 3
7
SIMD merging
one SIMD comparison and
“shuffle” operations for
each stage without
conditional branch
stage 1
8
output
Output
eight values in two vector
registered are now sorted
(example of odd-even merge)
5
SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures
© 2015 IBM Corporation
IBM Research – Tokyo
Background:
Integrating Odd-even merge into usual merge
sorted
sorted array 1
1
4
7
9
10
11
12
14
21
23
sorted array 2
2
3
5
6
8
13
15
16
17
18
sorted
6
SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures
© 2015 IBM Corporation
IBM Research – Tokyo
Background:
Integrating Odd-even merge into usual merge
sorted array 1
1
4
7
9
10
11
12
14
21
23
sorted array 2
2
3
5
6
8
13
15
16
17
18
sorted
sorted
vector registers
1
4
7
9
2
3
5
6
register-level merge
7
SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures
© 2015 IBM Corporation
IBM Research – Tokyo
Background:
Integrating Odd-even merge into usual merge
sorted array 1
1
4
7
9
10
11
12
14
21
23
···
sorted array 2
2
3
5
6
8
13
15
16
17
18
···
sorted
vector registers
8
1
2
3
4
5
SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures
6
7
9
© 2015 IBM Corporation
IBM Research – Tokyo
Background:
Integrating Odd-even merge into usual merge
sorted array 1
1
4
7
9
10
11
12
14
21
23
sorted array 2
2
3
5
6
8
13
15
16
17
18
use a scalar comparison
to select array to load from
vector registers
merged array
5
1
2
3
4
6
7
9
output smaller four values
as merged array
sorted
9
SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures
© 2015 IBM Corporation
IBM Research – Tokyo
Background:
Integrating Odd-even merge into usual merge
sorted array 1
1
4
7
9
10
11
12
14
21
23
sorted array 2
2
3
5
6
8
13
15
16
17
18
vector registers
8
13
15
16
5
6
7
9
register-level merge
merged array
1
2
3
4
sorted
10
SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures
© 2015 IBM Corporation
IBM Research – Tokyo
Background:
Integrating Odd-even merge into usual merge
sorted array 1
1
4
7
9
10
11
12
14
21
23
sorted array 2
2
3
5
6
8
13
15
16
17
18
vector registers
merged array
9
1
2
3
4
5
6
13
7
15
16
8
sorted
11
SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures
© 2015 IBM Corporation
IBM Research – Tokyo
Background: Multiway mergesort
Standard (2-way) mergesort
unsorted input
2way
2way
2way
2way
sorted
repeat
2-way
merge
for
log2(N)
stages
sorted output
Multiway (k-way) mergesort, here k = 4
unsorted input
4way
4way
sorted
12
SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures
k-way
merge
logk(N)
stages
sorted output
© 2015 IBM Corporation
IBM Research – Tokyo
Background: Multiway merge with SIMD
system memory
input streams
stream stream stream stream
0
3
1
2


2-way
merge


stream stream stream stream
4
7
5
6
2-way
merge

2-way
merge




2-way
merge

2-way
merge


2-way
merge

stage 1
stage 2

intermediate
buffer
2-way
merge
processor’s cache memory
stage 3
system memory
output stream
13

SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures
(8-way merge as an example.
we used 32-way merge
in actual implementation)
© 2015 IBM Corporation
IBM Research – Tokyo
Outline
Goal
Background: SIMD multiway mergesort for integers
 Performance problems in existing approaches
 Our approach
 Performance results
14
SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures
© 2015 IBM Corporation
IBM Research – Tokyo
Existing approach for sorting structures with SIMD
 For sorting structures with SIMD mergesort, a frequently
used approach (key-index approach) is
1. pack key and index for each record into an integer.
2. sort the key-index pairs with SIMD, and then
3. rearrange the records based on the sorted keyindex pairs
 Costly due to random
accesses for main memory
15
SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures
© 2015 IBM Corporation
IBM Research – Tokyo
Performance problems in existing approaches
 Existing approaches for sorting structures with SIMD
instructions have performance problems
– Key-index approach: SIMD friendly but not cache
friendly due to random memory accesses
– Direct approach: cache friendly but not SIMD friendly
because the keys are not
 key payload key payload 
stored contiguously in
memory
record i
record i +1
 Our approach takes the benefits of both approaches
16
SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures
© 2015 IBM Corporation
IBM Research – Tokyo
Our approach:
Rearranging at each multiway merge step
Key-index approach
unsorted input
encode
Our approach
Direct approach
unsorted input
unsorted input
encode
merge
integers
rearrange
a multiway
merge
stage
(e.g. 8-way
merge)
merge
structures
SIMD
unfriendly
encode
cache
unfriendly
rearrange
sorted output
17
rearrange
sorted output
SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures
sorted output
© 2015 IBM Corporation
IBM Research – Tokyo
Our approach:
Rearranging at each multiway merge step
Key-index approach
unsorted input
encode
Our approach
Direct approach
unsorted input
unsorted input
encode
merge
integers
Number of
memory copy
per record
rearrange
sorted output
18
merge
structures
at every
2-way
merge
step:
encode
only once
(random
only once
access)
rearrange
a multiway
merge
stage
(e.g. 8-way
merge)
multiple
times
times:
log2(N)
log
logkk(N)
(N)
rearrange
sorted output
SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures
sorted output
© 2015 IBM Corporation
IBM Research – Tokyo
Our approach: Multiway Merge with SIMD
system memory
stream stream stream stream
0
3
1
2




stream stream stream stream
4
7
5
6




input streams
extract key and encode {key, streamID} pair into an integer value
2-way
merge
2-way
merge
2-way
merge


2-way
merge

2-way
merge

2-way
merge

•rearrange records
based on merged
integer values
processor’s cache memory
stage 1
stage 2

stage 3
•move only integers
during merging
2-way
merge
decode streamID and copy records
system memory

19
SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures
output stream
© 2015 IBM Corporation
IBM Research – Tokyo
Our approach: Overall sorting scheme
Step 1: divide into
small blocks that can
fir in (L2) cache
unsorted input
Step 2: sort each
block by vectorized
combsort
k-way merge
k-way merge
k-way merge
k-way merge
Step 3: repeatedly
execute multiway
merge to merge all
blocks
k-way merge
sorted output
sorted
(read the paper on
vectorized combsort)
20
SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures
© 2015 IBM Corporation
IBM Research – Tokyo
Performance Evaluations
 System
– 2.9-GHz Xeon E5-2690 (SandyBridge-EP) processors
• 2 sockets x 8 cores = 16 cores
• using SSE instructions (128-bit SIMD)
– 96 GB of system memory
– Redhat Enterprise Linux 6.4, gcc-4.8.2
 We compared the performance of following algorithms
– Multiway mergesort (k = 32 way)
• Our approach, key-index approach, direct approach
– Radix sort, STL std::stable_sort, STL (unstable) std::sort
21
SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures
© 2015 IBM Corporation
IBM Research – Tokyo
22
110
100
90
80
70
60
50
40
30
20
10
0
with SIMD
without SIMD
faster
•Our approach
gained largest
speed up from
SIMD
among three
approaches
•It achieved
comparable
performance to
radix sort
execution time (sec)
Performance for Sorting 512M 16-byte records
with and without SIMD instructions
Our
Key-index Direct Radix sort
STL
STL sort
approach approach approach
stable_sort (unstable)
multiway mergesort
SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures
© 2015 IBM Corporation
IBM Research – Tokyo
Performance Scalability with multiple cores
45
Our approach
Key-index approach
Direct approach
Radix sort
STL stable_sort
STL sort (unstable)
35
30
25
20
faster
relative perf ormane over STL's
std::stable_sort on 1 core
40
15
10
5
• Our approach
achieved the
best
performance
with multiple
cores up to 16
cores
• It showed
better scalability
than radix sort
0
0
4
8
number of cores
12
16
using 512M 16-byte records
23
SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures
© 2015 IBM Corporation
IBM Research – Tokyo
Performance with various sizes of a record
cache line size
7
executin time (sec)
6
5
4
faster
Our approach
Key-index approach
Direct approach
Radix sort
STL stable_sort
STL sort (unstable)
• Key-index
approach
performed well
for very large
record sizes
due to smaller
number of
data copy
3
2
1
0
8
24
16
24
32
48
64
96
record size (byte)
128
192
256
using 16 million records on 1 core
SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures
© 2015 IBM Corporation
IBM Research – Tokyo
Summary
 We developed a new algorithm for sorting an array of
structures, that enables
– efficient use of SIMD instructions
– efficient use of cache memory (no random accesses)
 Our new algorithm outperformed the key-index
approaches in multiway mergesort especially when
each record is smaller than a cache line
 It outperformed the radix sort for records larger than 16
byte and showed better scalability with multiple cores
Read the paper for more detail:
efficient 4-wide SIMD exploitation, results with 10-byte
ASCII key, sorting for variable-length strings
25
SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures
© 2015 IBM Corporation
IBM Research – Tokyo
backup
26
SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures
© 2015 IBM Corporation
IBM Research – Tokyo
Optimization techniques: partial key comparison
 To use 4-wide SIMD (32 bit x 4) instead of 2-wide SIMD
(64 bit x 2)
– Increasing data parallelism and giving the higher
performance
– Sorting based on a partial key
• We confirm that the sorted order is correct using the entire
key when rearranging records using scalar comparison
27
SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures
© 2015 IBM Corporation
IBM Research – Tokyo
Sorting for variable-length strings
256M 12-byte to 20-byte records
(16 bytes on average)
64M 12-byte to 84-byte records
(48 bytes on average)
100
30
faster
80
execution time (sec)
execution time (sec)
90
70
60
50
40
30
20
25
20
15
10
5
10
0
0
Our
Key-index Direct Radix sort
approach approach approach
multiway mergesort
Our
Key-index Direct Radix sort
approach approach approach
multiway mergesort
28
SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures
© 2015 IBM Corporation
IBM Research – Tokyo
Performance with various numbers of records
lower is faster
execution tim e (sec)
100
10
Our approach
Key-index approach
Direct approach
Radix sort
STL stable_ sort
STL sort (unstable)
1
0. 1
4M
8M
16M
32M
64M
# records
128M
256M
512M
using 16-byte records on 1 core
29
SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures
© 2015 IBM Corporation
IBM Research – Tokyo
80
70
60
50
40
30
20
10
0
30
128
256
512
1k
2k
128
256
512
1k
2k
4
3
2
1
64
32
16
8
4
0
2
execution time (sec)
5
• Larger ways
reduced path
length but
increased
cache misses
faster
using 512M
16-byte records
64
number of ways
6
on 16 cores
32
16
8
4
2
faster
on 1 core
execution time (sec)
Effect of number of ways in multi-way merge
number of ways
SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures
© 2015 IBM Corporation