Efficient Skycube Computation on the GPU

Efficient Skycube Computation on the
GPU
Kenneth Sejdenfaden Bøgh, 20071354
Master’s Thesis, Computer Science
June 2013
Advisor: Ira Assent
AU
AARHUS
UNIVERSITY
DEPARTMENT OF COMPUTER SCIENCE
ii
Abstract
The skyline query is used for multi-criteria searching, and returns the most
interesting points of a potentially very large data set with respect to any monotone preference function. The skycube consists of the skylines for all possible
non-empty subsets of dimensions. Existing work has almost exclusively focused
on efficiently computing skylines and skycubes on one or more CPUs, ignoring
the high parallelism possible on Graphics Processing Units. Furthermore, current algorithms scale badly for large, high-dimensional data since the amount
of computations needed grows exponentially. In this thesis we investigate the
challenges of efficient skyline and skycube computations that exploit the computational power of the GPU. We also study the limitations and strength of both
the CPU and the GPU to determine how fully utilize both. We present a sorting based, data-parallel skyline algorithm which runs efficiently on the GPU,
and discuss it properties. We then use this algorithm as a basis for developing
several efficient GPU based skycube algorithms, each of which shows the impact
of different optimization opportunities in the domain of data-parallel skycube
computations. In a thorough experimental evaluation, we demonstrate that our
algorithms are substantially faster and scale much better than the state-of-theart algorithms on large, high-dimensional datasets.
iii
iv
Resumé
Skyline-beregninger bruges til multikriterie-søgninger og beregner hvilke data
der er interessante, i en potentielt meget stor datamængde, med hensyn til enhver præferencefunktion. En Skycube er mængden bestående af resultaterne, af
alle mulige Skyline-beregninger på et data-sæt, med hensyn til enhver præferencefunktion. Eksisterende forskning har næsten udelukkende fokuseret på
at udvikle metoder til Skyline og Skycube-beregning, som kører effektivt på
en eller flere CPU’er. Således har det store potentiale for parallelle beregninger på grafikkortet været stort set ignoreret. Det er sket til trods for, at
nuværende algoritmer skalerer dårligt for store datamængder med mange dimensioner grundet en eksponentiel vækst i mængden af beregninger, der skal
udføres. Dette speciale undersøger, hvilke udfordringer der er ved at anvende
grafikkortet til at lave effektive Skyline- og skycube-beregninger. Vi udforsker
styrker og svagheder ved CPU’en så vel som grafikkortet, for at afdække hvordan begge kan udnyttes til fulde. Vi udvikler en effektiv, sorteringsbaseret,
parallel Skyline-algoritme, som udnytter grafikkortet til fulde. Vi undersøger
denne algoritmes egenskaber og bruger den som basis til at udvikle en række
parallelle Skycube-algoritmer, der alle udnytter grafikkortet, og hver især viser
styrker og svagheder ved forskellige optimeringer. Til slut viser vi, at vores
algoritmer er væsentligt hurtigere og skalerer meget bedre end de hurtigste
CPU-baserede algoritmer vha. en grundig eksperimentel evaluering.
v
vi
Acknowledgements
I would like to thank my supervisor Professor Ira Assent for her invaluable
support in getting this thesis done, and for being very constructive when things
got tough.
I would also like to thank Matteo Magnani for doing a fantistic job as a stand-in
supervisor while Ira was on maternity leave.
Furthermore I would like to thank Jongwuk Lee, the author of the QSkycube
algorithm for providing his source code and datasets, so that we could make
the most fair comparisons to the state-of-the-art algorithms. Finally I would
like to thank family and friends for not abandoning me despite my, in times,
obsessive focus on the thesis.
Kenneth Sejdenfaden Bøgh,
Aarhus, June 21, 2013.
vii
viii
Contents
Abstract
iii
Resumé
v
Acknowledgments
vii
1 Introduction
1.1 Thesis statement . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Structure of the thesis/contributions . . . . . . . . . . . . . . . .
2 Background and preliminaries
2.1 Parallel and GPU Computing . . . . . . . . . .
2.1.1 Parallel Computing . . . . . . . . . . . .
2.1.2 Flynn’s taxonomy . . . . . . . . . . . .
2.1.3 GPU computing . . . . . . . . . . . . .
2.2 Skyline and sub-skyline computation . . . . . .
2.2.1 Definition of Skyline and sub-skyline . .
2.3 Skycube . . . . . . . . . . . . . . . . . . . . . .
2.3.1 Definition of the skycube . . . . . . . .
2.3.2 Related skycube work . . . . . . . . . .
2.3.3 Limitations of single skyline algorithms
3 Parallelization of the skycube calculation
3.1 Skyline Computation on the GPU . . . . . .
3.1.1 Sharing work between CPU and GPU
3.1.2 Dominance test . . . . . . . . . . . . .
3.1.3 The GGS algorithm . . . . . . . . . .
3.1.4 Choosing sorting function . . . . . . .
3.1.5 Choosing alpha . . . . . . . . . . . . .
3.1.6 Sorting data in parallel . . . . . . . .
3.1.7 Data transfer on the pci-e slot . . . .
3.1.8 Coalesced reading . . . . . . . . . . .
3.1.9 Utilizing the shared memory . . . . .
3.1.10 Thread Organization . . . . . . . . . .
3.2 Parallel skycube computation . . . . . . . . .
3.2.1 The lattice as a datastructure . . . . .
3.2.2 The naive approach . . . . . . . . . .
ix
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
4
4
.
.
.
.
.
.
.
.
.
.
5
5
5
6
6
14
14
16
16
17
19
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21
21
21
22
22
24
24
25
25
25
25
26
26
27
27
3.3
3.2.3 Utilizing lattice parent . . . . . . . . . . . . . .
3.2.4 Minimizing sorting . . . . . . . . . . . . . . . .
3.2.5 Parallelization of the CPU part . . . . . . . . .
3.2.6 Optimizing for incomparable points . . . . . .
3.2.7 Utilizing cuda streams . . . . . . . . . . . . . .
3.2.8 Utilizing the constant memory . . . . . . . . .
Summary of parallel skyline and skycube computation
4 Experiments
4.1 Synthetic data . . . . . . . . . . .
4.2 Varying α . . . . . . . . . . . . . .
4.3 Varying β . . . . . . . . . . . . . .
4.4 Comparing algorithms on synthetic
4.5 Comparing algorithms on real data
. . .
. . .
. . .
data
. . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
28
34
36
37
41
41
41
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
43
43
44
47
50
54
5 Conclusion and future work
57
5.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Primary Bibliography
57
A GPU glossary
63
1
2
Chapter 1
Introduction
The skyline query is useful in multi-criteria decision making, data-mining and
searching. The skyline can be considered the set of all the best trade-offs in the
dataset. To show how the skyline works, we consider the example of buying a
used car, using the toy dataset in Table 1.1.
C1
C2
C3
C4
C5
C6
Vol(cc)
2319
1330
1249
1587
1461
1149
Horsepower
126
75
60
109
82
60
Age
17
17
5
13
8
13
Kilometers
471000
213000
31000
189000
270000
240000
Price
8000
6900
49000
19900
28900
12999
KPL
11.2
14.9
17.9
14.5
23.8
16.7
doors
4
3
5
5
4
3
Table 1.1: Example of used car dataset
We consider the case of a young student buying his first car, mostly for small
trips in the city. Students typically do not have a lot of money and insurance
is typically more expensive for more powerful cars, especially for young people.
As such the student may be interested in cheap cars, with low engine volume
and few horsepowers. To help the student make a decision among the cars in
Table 1.1, one could do a skyline query, with preference for minimizing price,
volume and horsepower. A skyline query on a d-dimensional dataset returns
the interesting points by only returning those points which are not dominated
by other points. A point p is not dominated if no other point q in the dataset
is as good or better than p on all dimensions while also being better on at least
one dimension. The query would therefore return C2 and C6, since these do
not dominate each other, while at least one of them dominates each of the other
cars. For example C6 dominates C3 since it is better on volume and price, and
they are equal on horsepower. We return both C2 and C6, since C2 is better
than C6 on price, while C6 is better than C2 on horsepower and vol, which
means that none dominates the other. Thus the skyline can be seen as a filter
on data to remove those points that are not interesting w.r.t the preference
function. Another example could be a family that wants to use the car to go
on holidays, and takes the kids to school every day. For such a scenario your
preference might be high KPL (Kilometers per liter) and many doors, which
3
would return C3 and C5 since the rest either have fewer doors or lower KPL
while not being better or as good as the skyline points on the other dimensions.
Indeed many different scenarios for buying a car may be thought of, each producing a skyline query on a different subset of dimensions. Furthermore a user
may want to include or exclude search criteria as they see the results of their
first queries, which would also lead to further skyline queries. Since the skyline
computation can be very time-consuming, many queries can be a problem. This
is where the skycube comes in. The idea is to precompute the skyline of all
possible subsets of dimensions, and then simply do a lookup when a skyline of
a specific subset of dimensions is requested.
As can be seen from this introduction, the skyline query is very useful in many
decision making systems, such as choosing cars, hotels, the next computer or
even what house to buy. Basically any situation where one faces some trade-off,
and the amount of options is too large to get a proper overview. The skycube
enables the efficient application of the skyline query in real world scenarios,
without the need for time-consuming computations for each query. However,
the skycube itself is also very time-consuming to compute, especially on large,
high-dimensional data. Therefore the skycube is only useful in a real world
scenario, if we can compute it within a reasonable time-frame.
1.1
Thesis statement
Both the skyline and the skycube are very useful constructions for supporting
data analysis and decision making, by filtering data that is not interesting in
a given use case. However, both can be computationally demanding operations, especially for large, high-dimensional data. In this thesis we investigate
if it is possible to speed up the skycube computation by utilizing the Graphics
Processing Unit (GPU) as a co-processor for the computationally demanding
operations, while still maintaining a structure for sharing results on the CPU
to minimize the amount of computations.
1.2
Structure of the thesis/contributions
In Chapter 2 we give a detailed introduction to the GPU, formally define the
skyline and skycube and present related work. Chapter 3 presents our parallel
skyline and skycube algorithms, including several different approaches for the
skycube computation. In Chapter 4 we verify the efficiency of our algorithm,
while Chapter 5 discusses future work and concludes this thesis.
4
Chapter 2
Background and preliminaries
2.1
Parallel and GPU Computing
This section introduces parallel computing, the taxonomy that is traditionally
used when differentiating parallel computing systems and gives a detailed description of parallel computing on both CPUs and GPUs.
2.1.1
Parallel Computing
The basic idea of parallel computing is to execute several computational tasks
in parallel. This is done mainly for three reasons. The first reason is simply
to speed up the total computation time. This is important in several scenarios
for several different reasons. First of all, parallel computations can improve
response times in interactive systems, where a user expects results within reasonable time limits, in order to issue more commands. Secondly, speeding up
analysis of large data sets might make the analysis more valuable, or make an
analysis feasible, that was not considered feasible in the past. That is, if the
data analysed is outdated by the time the results are ready, the analysis might
not be worth the effort or in the worst case, useless. The second major reason
to do parallel computing is the ability to improve on results, while keeping the
execution time constant. A classic example of this is the movie series called
Toy Story, where the first movie was the first fully computer-generated movie.
It was rendered in 1995 using a farm of 100 dual processor machines. As a
comparison Toy Story 2 was rendered using a 1400 processor system, yet the
time spend rendering each frame have stayed more or less constant. Instead
the extra computing power has been utilized to create better textures, more
realistic clothes and atmosphere as well as more details.
The above examples are taken out of business contexts. This is no coincidence,
since parallel computing for a long time was mainly done by very big companies
or in research contexts. However, with the introduction of dual-core and quadcore CPU’s for personal computers this has changed during the past decade.
Therefore the third and most recent reason for doing parallel computing is to
fully utilize the hardware in modern PCs.
5
2.1.2
Flynn’s taxonomy
In order to differentiate parallel systems the literature traditionally uses what
is known as Flynn’s taxonomy [20], which is based on two measures, namely
the amount of data streams and the amount of instruction streams. These are
either multiple or single, thus creating four categories of systems. Below is a
short outline of each of these, and what the terms cover.
SISD
SISD or Singe instruction, Single Data covers the classical computer as originally described by von Neumann [20] and all single core processors fall into
this category. In this system a single stream of instructions processes a single
stream of data as the name suggest.
SIMD
SIMD is an abbreviation for Single Instruction, Multiple Data. In this kind
of system multiple processors receive the same instructions to be performed
on different data, in a broadcast fashion. As such all processors in the system
receive and execute the same instructions, but on different streams of data. The
Graphics Processing Unit can be thought of as a kind of SIMD system. This
will be discussed in more detail in at the end of this section.
MISD
Multiple Instruction, Single Data is the idea of having several processors executing different instructions, but on the same data. This kind of system,
although possible to construct, is not currently used in real world scenarios. It
is mentioned here, and in the literature in general, mostly for completeness.
MIMD
The last of the four categories, is Multiple Instructions, Multiple Data. As
the name suggest, this kind of system have both an instruction stream and a
data stream for each processor. This also makes it the most general of the
four categories, since each of the other three can be modelled using the MIMD
structure. Since MIMD is the most general of the four, it would make sense to
model a parallel computer from it, and this is indeed the case with all modern
parallel computers that are based on CPUs.
2.1.3
GPU computing
As the name suggests, the Graphics Processing Unit (GPU) process graphics.
It is the part of the computer hardware that displays the images seen on the
screen of a computer, by processing the signals received from the Central Processing Unit (CPU) which controls the entire computer.
Researchers have been utilizing the GPU to do parallel computing for many
years. Early on, it was done by using the programmable parts of the GPU to
6
create parallel results as side-effects of the graphics processing, thus "misusing"
the APIs to do parallel computations. This is known as General Purpose programming on the Graphics Processing Unit (GPGPU). In recent years hardware
manufacturers have realized the potential in this and have started developing
both specific hardware for GPGPU and specialized APIs, libraries and language
extensions to make programming GPGPU applications easier. This has made
utilizing the great potential computing power of the GPU more accessible to
programmers, and it has lead to the application of GPUs in many different
areas.
The programming platform
Before we introduce the GPU in detail, a discussion regarding choice of platform
is needed. First, GPUs of different vendors have different hardware layouts. As
such it is not possible to introduce the GPU hardware model in general, like
it is with CPUs. Secondly, the choice of hardware platform also affects what
tools are available. For this thesis we have chosen the Compute Unified Device
Architecture (CUDA) [14] from nvidia. Since CUDA is designed specifically
to nvidias own hardware, this also means that we have chosen to use nvidia
hardware. The alternative to CUDA is the open standard called openCL [15].
The reason we have chosen CUDA is that, while it is a vendor specific and
proprietary standard, it provides much better tool support with an IDE customized for the purpose and debugging of running parallel code which means
that finding errors becomes much easier. CUDA is also more mature, and is
developing faster than openCL. The newest edition of CUDA was released in
October 2012, while openCL 1.2 was released in November of 2011. Whether
the slower development of the standard is due to the many companies that
develop openCL (Apple, nvidia, AMD, Intel, IBM and many more) or that all
companies just needs to get up to date with 1.2 first we do not know. The
more mature CUDA also has libraries available for free which are more complete than the openCL alternatives. The libraries provide features that make
the implementation work for this thesis easier and faster and as such allows for
a more complete implementation. However, we do not use any features that
are specific to nvidia hardware or software, so our algorithms will be portable
to other platforms. The sections below will outline the structure of GPUs, the
memory model and the programming model, with emphasis of weaknesses and
strength as compared to traditional parallel programming. The architecture
used as an example is that of the nvidia GPU.
Structure of GPUs
In order to understand the limitations and strength of the GPU, it is important
first to understand how the components of the hardware are put together as
compared to the CPU. A traditional CPU is built to execute sequential programs very fast. In order to do so, it has a lot of control logic implemented to
make sure that the next command of a given process is ready for the processors
and it typically has a lot of cache so that data can be fetched and kept close
7
to the processing unit when it is needed. A GPU on the other hand is build to
render pixels for graphics, which can be done in a highly parallel manner. As
such the GPU is build with many parallel cores, while sacrificing some of the
control and cache logic. Figure 2.1 gives an overview of these differences. The
impact of the larger cache and control in the CPU is that it is able to apply
clever strategies to predict what data will be fetched from memory and attempt
to load it before it is needed. It also makes fetching the next instructions for
each core very efficient, so that the CPU can swap between processes quickly.
However, the drawback is that the computing throughput is hurt since there
are less resources allocated to do the actual raw computing.
The architecture of the GPU on the other hand, allows a high level of parallel
throughput by having many cores for computing. However, this increase in potential throughput comes at the expense of less cache and less space for control.
This results in a relatively high latency when the GPU reads data from the its
memory, even though the speed and bandwidth of the memory on a GPU typically is much higher than that of CPU memory. In order to hide this latency,
GPUs have a special threading model which allows many thousands of light
weight threads to be active at once, on fewer cores. As such while some threads
are waiting for data to load, others can utilize the available cores to execute
computations, thus "hiding" the memory latency. This model is described in
more detail in Section 2.1.3.
The last thing we want to mention about GPUs is how these many cores are
organized. Although Figure 2.1 presents the general design concept of GPUs it
is a simplification.
The cores in a GPU are called Streaming Processors (SP). The SPs are grouped
together on a chip and form what is called a Streaming Multiprocessor (SM).
Several SMs, along side the memory, then constitute the GPU. How many SPs
there are per SM is dependant on the hardware. For an example, the current
generation nvidia GPUs have 192 SPs per SM where as the last generation had
32 SPs per SM. Typical current generation GPUs have between 2 and 16 SMs.
Each SM has an amount of resources that are shared among all the SPs such
as cache and scheduler, but these will be explained in later sections.
Figure 2.1: The basic hardware layout of a CPU and a GPU from [14]
8
The Threading model
As mentioned above the GPU trades off cache and control logic in the hardware
design in order to provide more raw computing power. This decision has two
major impacts. The first is a high memory latency since the cache is smaller and
less intelligent than that of the CPU. The second is that each GPU core cannot
work independently of the others. Instead the same instruction is send to many
cores at once, which then execute that instruction, but on different data. This
is done by dividing the threads into groups of 32 called warps. The GPU then
schedules warps that have completed their last instruction for execution. As
such each warp executes step locked, instruction by instruction, but different
warps execute independently unless the program explicitly synchronizes them.
Since the memory latency can effectively slow down the whole execution, one
must try to hide this latency. This is done by ensuring that there are always
more warps waiting for execution, than can be executed at once. This allows
for some warps to execute computations, while others are waiting for I/O operations from the ram. If enough warps are present, and they are not too heavily
memory dependent, then there is always a warp ready to execute, thus minimizing the idle time of the SPs. It should be mentioned that the step-locked
behaviour of warps can be broken. This is called branch divergence and the
result is that the behaviour of a warp becomes partially sequential. To clarify
this, imagine the code of an executing warp reaching an if-then-else statement.
If half of the threads evaluates true and the rest false in the if statement, then
the next instruction will not be the same for the entire warp. As a result half
of the warp will execute first, and then the other half will be executed, until
the next instruction after the branching. This is also the case if just one thread
gets a different evaluates differently, thus resulting in just one of the 32 threads
executing in the branching part of the code.
While the threads are executed as warps, they are logically organized in what
is called blocks. A block is simply a 1D, 2D or 3D "matrix" of threads. These
blocks are then organized in a 1D, 2D or 3D matrix called a grid. Figure 2.2
shows an example of a 2D grid, of 2D blocks. We have this three layer structure
because each block is expected to be fully contained in an SM. The reason for
this is that the parallel program has access to some of the shared resources on
the SM, and as such the SM defines a upper bound on the amount of threads in
a block. This will be discussed further in Section 2.1.3. As a last note it should
be mentioned that treads, blocks and grids are software constructs. What this
means is that, while the hardware limits the amount of threads per block, it
does not limit the total amount of threads or blocks. In fact the literature
advices that one launches many more threads than can be active at one time
in order to utilize the hardware fully [14].
9
Figure 2.2: The thread hierarchy of a GPU [14]
The Memory Model
Although the GPU trades off cache and control logic for more raw power, it still
has two layers of small cache, and a memory model consisting of 5 different types
of memory. Figure 2.3 provides an overview. We will introduce each of the three
main kinds of memory, starting from the bottom. The most abundant type of
memory is the global memory. It is located off-chip and can be considered
equivalent to the main memory of the CPU. It typically has a size ranging from
512MB to 6GB. The gtx670 card we use in this thesis has 4GB of global memory.
Since the global memory is off-chip, and the cache on the GPU is small and less
intelligent, it also takes a long time to access it. Typically 200-400 clock-cycles
for the newest generation of hardware, which is what we use. This weakness of
the GPU has improved over time due to higher clock speed of the memory and
added caching, but it is still a real bottleneck in many applications. As can be
seen from Figure 2.3, all threads can read and write to global memory. Besides
this the global memory is also where data is transferred to and from, by the
CPU.
The next level of memory is called shared memory. As the name indicates it is
shared by a block of threads, and can thus to a certain degree be used to share
results between threads in a block. This memory is located on the same chip as
the SM, which means that it is very fast, and has a very low latency compared
to global memory. It is typically used to load data from global memory, if the
threads access the same data more than once. However, the shared memory is
10
a very limited resource, with a size of only 48KB per SM in current generation
hardware, which limits its usability in some cases.
The last kind of memory is local to each thread, and is called registers. The
registers are the fastest kind of memory on the GPU, and are also located on
the SM chip. However the registers are a very sparse resource, so they must
be used with care. If the threads use more registers, than what is available
it is called register spilling. What happens is that the extra needed registers
are allocated on the global memory. This of course reduces performance since
it means that some of the variables used in each thread must be read from
the high latency global memory. This can be avoided by forcing the GPU to
only allow the usage of the registers that are actually available. However, the
consequence would then be that fewer blocks run on the SMs at any given time,
thus rendering SPs idle.
More generally speaking it is typically either the shared memory or the registers
that limit the amount of threads that run on each SM. Whether it is better to
allow more reads from global memory to maximize the amount of active threads,
or have fewer threads with more registers and shared memory available depends
on the application.
Besides the three main types of memory mentioned, there are two special types
of memory which both resides inside the global memory. The first is called
constant memory which is a cached, read-only memory. That is, it can be
written from the CPU, but only read from the GPU. As with shared memory
this is a limited resource of only 64KB, but if accessed appropriately it can
be very fast. The main idea behind this is broadcasting. That is, if many
threads read the same address in the constant memory, it can be broadcasted
to those threads, thus making a more efficient read than directly from global
memory. The other special kind of memory is the texture memory. As the name
indicates this is normally used for storing textures for graphics processing. As
with constant memory this is a cached memory, but the cache is optimized for
2D and 3D arrays with locality. This caching strategy can be visualized as a
square (2D) or a box (3D) with the loaded data in the center. Again there is a
limit on how much we can allocate, namely 65536 elements on each dimension
for 1D and 2D and 4096 for each dimension in 3D.
The last thing we want to mention regarding memory is the access patterns.
When reading from global memory, the GPU always reads aligned 128-byte
blocks of data. What this means is that if two threads in the same warp each
read a 4 byte float, but from memory addresses that are more than 128 bytes
apart, then two times 128 bytes will be fetched from global memory, while
only 8 bytes (the two floats) will actually be used. As such reading data in a
coalesced way from global memory is very important to keep the throughput
high. This issue is especially evident on older generation GPUs. The global
memory is cached on the new generations of GPUs, but the effect is still very
visible if there are global memory accesses far apart in terms of addresses.
11
Figure 2.3: The memory hierarchy of a GPU [14]
GPUs as seen in Flynns taxonomy
As can be seen from the introduction above, GPU programming is somewhat
similar to the SIMD category of Flynn’s taxonomy, however there is an important difference between the GPU model and SIMD. In SIMD all processors
execute the same instruction at the same time. That is all processors step
through the instructions together, as they are broadcast. The GPU on the
other hand works with lightweight software threads, and allows many more
concurrent threads than there are cores available to process the instructions.
The literature calls this Single Instruction Multiple Threads (SIMT) [14], and
it allows more parallel processing at once, as well as more flexibility, but it
also requires a more detailed understanding of the underlying hardware to fully
utilize.
12
Summary of parallel computing
We end this section with a summary of the most important differences between
parallel computing on the CPU and GPU.
Branch divergence Since the warps in the GPU essentially work as small
SIMD structures, branch divergence in the code will hurt performance.
Although some branching can be avoided by rewriting the algorithm or
the implementation, this is still an issue on the GPU. On the CPU however
each core has its own control and instruction logic which allows for many
different paths through the code in parallel with very little, if any, loss in
performance.
MIMD vs SIMT Since each CPU core has its own control logic, different
tasks can be performed on different cores at any given time. As such the
CPU has the ability of task parallelism, which makes algorithms utilizing
constructions such as parallel divide and conquer algorithms or parallel
recursion possible. The GPU, on the other hand is bound by running
the same code on all cores, thus not making it efficient to run traditional
divide and conquer algorithms. It is neither feasible to do recursion on
the GPU, since the stack for each thread is small, and the GPU threads
does not have the ability to spawn new threads.
Level of parallelism A modern GPU can have more than 3000 cores, where
as a typical modern CPU has between 4 and 16, although some specialized
CPUs have more. As such to utilize the full power of a GPU the algorithm
needs a very high level of parallelization, compared to a CPU. Whether
this is a problem or a gain is of course application specific since the extra
level of parallelism also results in more computations at once.
Memory latency For problems that are heavily dependent on memory, the
transfer speed between the CPU and the GPU can be a bottleneck for
the GPU. Moreover since the data is then either transferred to the shared
memory on the GPU, or accessed directly from the global memory, it
might not be possible to fully utilize the GPU for memory heavy problems,
thus leaving the SPs idle.
Memory size The memory of the GPU is typically faster than that of the
CPU, with the newest generations having a speed of 6Ghz, where as stateof-art memory for the CPU only have a speed of 2.6Ghz. However modern
computers can accommodate more than 100GB ram, where as the GPUs
currently have a limit of 6GB. If the problem size is more than 6GB, this
means that data must be swapped to and from the CPU memory which
can hurt performance and throughput.
Throughput Although the GPU is limited to run the same program on all
cores, it does have a lot more cores, and these cores each run at 9001600mhz depending on the specific device. As such the throughput of the
13
GPU can be much higher that the typical 16 core CPU running at around
3 Ghz, if the drawbacks are dealt with.
2.2
Skyline and sub-skyline computation
This section formally defines skylines, subspace skylines and the terminology
of skyline computation, as well as introducing previous work on skylines and
subspace skylines.
2.2.1
Definition of Skyline and sub-skyline
The skyline consists of all points, for which no better point exists in a given
dataset, with respect to some monotone preference function. This relation is
known as domination [3] and is formally defined as follows,
Definition 1 Dominance
Given a dataset D, a set of dimensions d and two points p, q ∈ D, then p
dominates q if and only if ∀i ∈ d pi ≤ qi ∧ ∃j ∈ d so that pj < qj . We write
this as p d q.
Given the definition of dominance we also define incomparability between
two points,
Definition 2 Incomparability
Given a dataset D, a set of dimensions d and two points p, q ∈ D, then p and
q are incomparable if and only if p d q ∧ q d p. We write this as p ≺d q.
Next we define the skyline as,
Definition 3 Skyline
Given a dataset D and a set of dimensions d, the skyline S of D is the subset
of D consisting of the points that are not dominated on d:
S = {p ∈ D| 6 ∃q ∈ D such that q d p}
Finally we define the subspace skyline as,
Definition 4 Subspace skyline
Given a dataset D, a set of dimensions d and a subspace U ⊆ d the subspace
skyline SU of D is the subset of D consisting of the points that are not dominated on U :
S = {p ∈ D| 6 ∃q ∈ D such that q U p}
As such, one can think of the skyline as the special case of the subspace
skyline where U = d.
14
Related skyline work
Since the skyline operator was first introduced to the database community in
2001 [3] it has received a lot of research attention, and many different approaches
to efficiently computing the skyline have been proposed. These can be roughly
divided into three categories, which we outline here. The first category is based
on the divide-and-conquer approach. These algorithms recursively divide the
data by some heuristic until the partitions have reached a certain size. The skylines of each partition is then computed, and the resulting skylines are merged
to produce the final skyline [3]. Depending on the heuristic used to partition
the data, it is possible to avoid some comparisons of data. This is achieved
by partitioning the data so that it is known that all records in certain pairs
of partitions are incomparable. As such some of the dominance tests can be
bypassed by carefully choosing the partition strategy. Furthermore, since the
local skylines can be computed in parallel, investigations on both distributed
system [22, 9, 23] and multicore CPUs have been performed [17].
The second category of skyline algorithms makes use of indexing structures.
These algorithms typically utilize hierarchical indices such as a B-tree [13] or
R-tree [10], by precomputing an index on the data, and the utilizing the index
to avoid accessing the entire dataset, as it was also suggested in [3]. The idea of
these approaches is to organize the data by either a reduction to a single dimension (B-tree) or by indexing the data directly as spatial data points (R-tree).
The tree traversal is then able to eliminate parts of the tree, by comparing
the skyline points already found to the values in each node for the B-tree, and
the minimum bounding rectangle for the R-tree. In fact a branch-and-bound
algorithm called BBS [16] was proven to be I/O optimal.
The third category of skyline algorithms is based on nested loops. Again this
approach was first described in [3] in an algorithm called Block Nested Loop
(BNL). The idea is to iterate the dataset without any preprocessing, and then
keeping a window of skyline points in the main memory. Each of the datapoints
in the dataset is then compared to each the current skyline candidates in the
window. If any point in the window is dominated, then it is removed and if the
new point is not dominated by any point in the window, it is added to it. If
the window overflows then the new candidates are written to a temporary file,
which then serves as an input to the next iteration of the algorithm [3]. This
approach was later refined by adding a presorting in an algorithm called SortFilter-Skyline (SFS) [5]. The idea of SFS is that if data is sorted according to
a monotone scoring function, then many dominance tests can be avoided. This
is due to the fact that, in the sorted order, a data point can only be pruned by
its predecessor, not its successor, which was first shown in [5]. This approach
was later refined by modifying the sorting function to do early pruning while
sorting [7] and by choosing the monotone scoring function so that it became
possible to stop the scan before the end of the dataset [2] in most cases.
All the algorithms above can be used as subspace algorithms by simply ignoring dimensions that should not be taken into account. However except BNL,
they all preprocess the data and this preprocessing either becomes suboptimal
when processing subsets of dimensions or needs to be done for every new subset
15
of dimensions. BNL can be applied directly as a subspace algorithm, but it
is inefficient for large datasets since it does a lot of dominance test. Due to
this fact several algorithms have been developed with the purpose of supporting sub-skylines. The ideas are based on the same base categories as the ones
mentioned above. First the index-based algorithm, which uses a heuristic to
reduce the dimensions to a single value which is then indexed in a B-tree [19],
with the heuristic and the tree traversal algorithms optimized towards subspace
searching.
The second category, is the distributed subspace skyline algorithms.These algorithms concentrate on distributing data among several nodes in a network
either with [6] or without a central unit [21] (peer-to-peer), and computing the
skyline by computing local skylines for the data located on each node and then
merging this data, to produce the final result.
To the best of our knowledge, only one work on GPU skyline algorithms is accessible. This approach translates the BNL [3] approach to the GPU. Basically,
the GPU-parallel skyline algorithm (GNL) runs nested parallel loops, which
compare all data points to all other data points [4]. Thus it is a simple parallel
implementation of the brute force algorithm.
2.3
Skycube
All algorithms presented in Section 2.2.1 focus on computing either the full
skyline or ad-hoc subspace skylines. While these algorithm are fairly efficient,
the response times are still relatively high since all of the algorithms still need
to do dominance tests in order to produce results, for each query. This delay
is affected by both the dataset size, the dimensionality of the data, the chosen
subset and the amount of concurrent queries. To address this issue, the authors
of [24] proposed the skycube where all the 2d − 1 non-empty sub-skylines are
precomputed, so that a query can be answered by a simple lookup, without any
further computations what so ever. This allows the skyline to be applied in
real world applications such as the used car example shown in Chapter 1 where
many users may do many queries at once. The skycube is also very suitable
for data analysis where users that are looking for interesting patterns in large
dataset, and might want to add or remove dimensions from the query many
times. In such cases we can compute the skycube once for the user and then
simply reuse the result as dimensions are added and removed. While the idea
of the skycube indeed makes the skyline much more useful in interactive multiuser systems, it is very expensive to compute. In fact it have been proven that
this is a NP-hard problem [18].
2.3.1
Definition of the skycube
In the skycube litterature [24, 12] the subspace skylines are called cuboids, and
a cuboid of a set of d-dimensional datapoints D on dimensions V ⊆ d is written
as SKYV (D). As such the skycube consists of the cuboids on the 2d − 1 nonempty subsets of d.
16
The original skycube paper [24], as well as later work, presents its algorithms
under what is known as the distinct value condition, which we will now repeat
before we present the related work.
Definition 5 Distinct value condition
Given a set of d-dimensional points D, the distinct value condition states
that, ∀p, q ∈ D p 6= q | ∀i ∈ dpi 6= qi .
Extensions that does not rely on this condition are presented for all the algorithms, but the distinct value condition does provide a nice feature of the
cuboids, called skyline containment which we present in Theorem 1.
Theorem 1 Skyline containment
Given a set of d-dimensional points D under the distinct value condition, and
two subsets U, V ∈ d so that U ⊂ V then ∀p ∈ SKYU (D) ⇒ p ∈ SKYV (D)
holds.
We will use this property when presenting the related skycube work next.
4
3
2
1
ABCD
ABC
AB
ABD
AC
A
AD
B
ACD
BC
BCD
BD
C
CD
D
Level
Figure 2.4: A skycube lattice
2.3.2
Related skycube work
The idea of the skycube was first introduced in [24]. The authors presented
the idea of using a lattice like the one in Figure 2.4 to represent the skycube.
The idea of the lattice is that each cuboid have children cuboids consisting of
its subspaces and parents cuboids consisting of its superspaces. The authors
presented two algorithms to compute the skycube, called Buttom Up Skycube
(BUS) and Top Down Skycube (TDS), which traverse the lattice as their names
indicate. We will now present both.
The idea of BUS is to compute the skyline by traversing the lattice in a levelwise, bottom up manner, starting with the single dimensions and ending with
the fullspace skyline. This enables the utilization of Theorem 1, to avoid some
dominance checks for all levels except the first (computing cuboids on single
dimensions). This is possible since we know by Theorem 1 that skyline points
of a subset U are also skyline points of the superset V when U ⊂ V . As such
the union of the all subsets of V can be directly added to the skyline of V .
Furthermore BUS utilizes sharing of sorting, by presorting the data according
17
to each of the single dimensions before the skycube computation. This sorting
is then used as the monotone sorting function for SFS, so that SFS can be used
to compute each of the cuboids. While the sorting is not the most efficient for
SFS pruning, it does ensure correct results for each of the cuboid calculations
as also shown in [2]. Furthermore, it reduces the amount of sorts needed from
2d − 1 to just d, as compared to naively computing each cuboid in isolation with
SFS.
The TDS algorithm, which was also presented in [24] is based on the divideand-conquer approach. This algorithm computes the skycube using a top down
approach, which utilizes two concepts to minimize the computation cost. First
TDS is able to compute several cuboids at once by extending the divide-andconquer approach. This is done by choosing a path from the top to the bottom
of the lattice, such as ABCD, ABC, AB, A, which we will use to demonstrate
the approach. The data is first recursively divided by the median of the dimension that is common to all the cuboids (in this case A). Once the data
have been split on A, and we start to merge the partitioned space, we perform
four merges in parallel, one for each of the cuboids. This way we compute all
cuboids along the path, while only having divided the data once. This saves
time since we need to divide the data fewer times.
Secondly, the authors utilize Theorem 1 by using the skyline of a superset
as input for the computations, rather than the entire dataset whenever possible. That is, once ABCD, ABC, AB, A have been computed, a next path i.e.
BCD, BC, B can be computed on the basis of the skyline of ABCD rather than
the whole dataset. This can be done since Theorem 1 states that the skyline of
a cuboid is completely contained in the skyline of its parent.
While both of the algorithms mentioned above share results and information to
reduce the cost of the skycube computation, the authors of the state-of-the-art
skycube algorithm QSkycube [12] found even more opportunities to share information when computing the skycube. This was done by analysing the faster
TDS, and realising two opportunities to further share information. First, TDS
only splits the data on one dimension at a time. Recent skyline research [25, 11]
has shown that splitting the data into 2d disjoint partitions around a pivot point
pv , provides a very efficient way of determining incomparability and dominance
between datapoints. This is done by assigning a d-dimensional binary vector
to each datapoint p, so that the i-th value in the vector is 0 if pi < pvi and 1
otherwise. Once these binary vectors have been created, all datapoints assigned
binary vectors with ∀i ∈ d Bi = 1 can be pruned as non-skyline points, since
pv dominates them on all dimensions. Moreover all pairs of points assigned
binary vectors B, B 0 with ∃i ∈ d Bi < Bi0 ∧ ∃j ∈ d Bj0 < Bj are known to be
incomparable. It is clear to see that this has the potential to eliminate a lot of
dominance tests. Furthermore, since these binary vectors contain dominance
information of the individual dimensions, they can be reused to determine partial dominance and incomparability of subsets of dimensions.
The second observation of the authors of [12] is that TDS only uses information
from one parent since the lattice is iterated in a depth first manner. They instead suggest to do the computation in a level-wise, top down manner. That is,
compute all cubiods in each level before moving on to the next, from level 4 to
18
1 in figure 2.4. This enables utilization of several parents, to reduce the input
datasize. The observation is that for a point to be a skyline point in dimension
subset U , it must also be a skyline point in all supersets V so that U ⊂ V
by the Theorem 1. As such, only the intersection of skylines of all supersets
contains the skyline of a subset.
All of the algorithms mentioned in this section rely heavily on the distinct value
condition. However the papers that presents them also considers extensions to
support the general case where duplicate values are allowed. For BUS, the problem is that the algorithm might include false positives, when doing the union
of the skylines of child cuboids. To avoid this, dominance tests are performed
with respect to the superset of dimensions, for data with equal values on some
dimensions when doing the union. This way, any points not in the skyline are
eliminated in the union step.
For TDS and QSkycube, some skyline points might be missing for lower levels
in the lattice, since they were eliminated in the skyline computation of previous
levels. To ensure that these points are included, both algorithms maintain a
second data structure that contains information on which points have equal
values on which dimensions. This data structure is then used to ensure that all
points with values equal to the skyline points on a given cuboid is also considered.
2.3.3
Limitations of single skyline algorithms
As the last part of this section we consider the limitations of single skyline
algorithms.
All the algorithms in Section 2.2.1 could potentially be used to compute the
skycube in a naive fashion, by simply computing the skyline for each subset
one by one. However, this would incur a very large overhead, since dominance
tests would be performed on the same dimensions of the same datapoints many
times. Moreover the index’ would either become suboptimal (R-tree) or need
to be recomputed (B-tree) for the subsets of dimensions. For the sort based
solutions, the data would need to be sorted for each cuboid computation, again
leading to an overhead.
19
20
Chapter 3
Parallelization of the skycube
calculation
In this chapter we develop our GPU-parallel skycube algorithms. We do this in
two steps. First we investigate the less complex case of developing an efficient
GPU-parallel skyline algorithm. Next, we then use the results of this work as
a basis for the development of several efficient GPU-based skycube algorithms.
3.1
Skyline Computation on the GPU
In this section we will outline the main challenges of doing efficient skyline
computations on the GPU. While these apply to both skyline and skycube
computations we will start by addressing the skyline case, and then extend our
solution to also cover the skycubes described in Section 3.2. As we develop
an efficient GPU skyline algorithm, we investigate major issues in doing SIMD
skyline computations. The work presented in this section has been converted to
a paper during the writing of this thesis, and published in the DaMoN workshop
at the SIGMOD conference with Ira Assent and Matteo Magnani as co-authors
[8]. As such this section is heavily based on the paper, although the writing
presented have been done entirely by the author of this thesis. It should also
be noted that the ideas and concepts presented are original work by this author
although they have been refined under supervision by both Ira Assent and
Matteo Magnani.
3.1.1
Sharing work between CPU and GPU
For any algorithm utilizing the GPU both the strength and weaknesses of the
platform must be considered. That is, individual components of the algorithm
must be assigned to the hardware that best supports them. For the domain
of the skyline computation, this means that the computationally demanding
dominance checks should be assigned to the GPU, while more lightweight bookkeeping of what points have been dominated and which have been confirmed as
skyline points should be done by the CPU, since this process is not easily parallelizable. As such this is the structure we strive for in our algorithm presented
21
in Section 3.1.3. However, before we present the algorithm, we first investigate
the all important dominance check, and how we can efficiently implement it on
the GPU.
3.1.2
Dominance test
Determining dominance relations is the operation that is executed the most
in any skyline algorithm. As such it is very important that we can do this
efficiently on the GPU. Since the GPU does not handle branch diversion very
well, we must ensure that the instruction streams of the individual threads is
kept the same whenever possible. To this end we have developed an optimized
and completely branch free algorithm for comparing data points. The pseodo
code can be found in Algorithm 1. We avoid branching by maintaining two
boolean values in lines 5-6, for the two points respectively. If p q then p
must be better on at least one dimension and q can never be better on any
dimension, which is exactly what we capture with the two booleans. This is
unlike a CPU optimized dominance check algorithm, which would introduce
several if-statements to stop comparing data as soon as possible, since the
optimization goal of the sequential execution is to minimize data comparisons.
We sacrifice minimizing the amount of value to value comparison, in order to
gain computational throughput. Next we present our GPU parallel algorithm,
which utilizes this branch-free dominance check.
Algorithm 1: BranchFreeDominance
Input : Two d-dimensional data points p and q
Output: Whether p dominates q
1 begin
2
bool pbetter ← f alse;
3
bool qbetter ← f alse;
4
for i ∈ d do
5
pbetter ← pi < qi ∨ pbetter ;
6
qbetter ← qi > qi ∨ qbetter ;
7
8
return ¬qbetter ∧ pbetter ;
end
3.1.3
The GGS algorithm
The most prominent problem of adapting current skyline algorithms to the
GPU is the fact that they are inherently data-dependent. That is, they are
build around the fact that we have some knowledge of whether other data
points have been pruned or not. This is mostly expressed by either the maintenance of a current set of skyline points or by the merging of skylines of smaller
datasets. In both cases the next computation step of the algorithms depends
on the result of the previous ones. For efficient computation using the GPU,
such assumptions are not possible, since they make massive parallel computations almost impossible. This being said, some of the features of the sort based
22
Algorithm 2: GGS
Input : D: a dataset D
α: Amount of points to compare against in each iteration
Output: The skyline of the dataset
1 begin
2
Sky ← ∅;
3
Dh ← D;
4
transfer D to device memory;
5
sort D in parallel;
6
while D 6= ∅ do
7
D0 ← {pi ∈ D | i ∈ 0...α};
8
Ddom ← ∅;
9
for ∀p ∈ D do in parallel
10
for qi ∈ {D | i ∈ 0...α} do
11
if BranchFreeDominance(qi ,p) then
12
Ddom ← p;
13
return;
14
15
16
17
18
19
20
21
transfer Ddom to the host;
Dh ← Dh \ Ddom ;
Sky ← Sky ∪ (Dh ∩ D0 );
Dh ← Dh \ D 0 ;
transfer Dh to device;
de-fragment data on device w.r.t Dh ;
D ← Sky
end
approaches [5, 2] may be reused in a SIMD-parallel skyline algorithm. Specifically the fact that for any monotone scoring function, a data-point cannot be
dominated by a successor in the sorted order. This feature was first proved in
[5] and makes its possible to mark any data point as a skyline point, if it has
been compared to all skyline points before it, in the sorted order. This is also
utilized in our GPGPU skyline (GGS) algorithm which we will now present.
Algorithm 2 contains the pseudo code for GGS. In line 2-3 the skyline set is initialized and the indices of the data are recorded in Dh . In lines 4-5 the dataset
is transferred to the device and sorted. Next, in lines 6-13 we compare each
point in D to the first α points of D. Since D is sorted according to a monotone
scoring function, and we start a thread per point in the dataset, we only need
to do dominance tests one way. That is, if any of the the first α points are
dominated, then it will be recorded by the threads responsible for those points.
If a data point is dominated, then the thread of that data point adds the index of the point to the set of dominated points. Once the parallel execution is
done, we transfer the information on the dominated points back to the host and
remove them from Dh in lines 14-15. In lines 16-17 we save those of the first
α points that were not dominated as skyline points, and remove them from D.
23
This is possible since these points have been compared to all points that could
dominate them. As the last step we transfer information on what points remain
back to the device, and execute a de-fragmentation of the data array on the
device. We iterate over the continuously smaller dataset, until it is empty. It
is guaranteed that the algorithm terminates since we remove at least α points
from D in each iteration. It should also be clear to see that the data dependency issues have been eliminated, since the dominance tests are performed on
a per point basis and without any dependencies on the results for other points.
Furthermore, this algorithm minimizes branch divergence, by only branching
if a GPU thread terminates. We include this if statement, since a warp may
finish quickly if all the points assigned to that warp are dominated. In such
cases the resources occupied by the warp are freed earlier thus allowing other
warps to execute earlier.
3.1.4
Choosing sorting function
Clearly our algorithm is dependent on the sorting function. Specifically, we
dependent on how well the first points in the sorted order prune non-skyline
points, since early pruning leads to fewer iterations, and allows each parallel run
to work on a smaller dataset. The authors of [2] conducted a series of experiments, to evaluate several different scoring functions and their pruning power.
They concluded that the sum of coordinates, also known as the Manhattan
norm, has the best pruning power. This will be the sorting we use throughout
this thesis, unless otherwise specified.
3.1.5
Choosing alpha
The parameter α specifies how many data points each point is compared to on
the device in each iteration, before the host takes over and filters the dominated
points from the dataset. This affects the performance of the algorithm in two
main respects. First, each iteration shrinks the dataset, which in turn accelerates the parallel performance since fewer threads need to be started. However,
there is also an overhead in terms of data transfer to/from the device and the
initialization of threads associated with each iteration. Secondly, the performance on different data distributions is affected. For correlated data where
the skyline is small, a small α can prune much of the data early. However, for
anti-correlated data which produce a large skyline, a large α is preferred so that
the overhead of each iteration in relation to the amount of pruned points, stays
at a good ratio. Preliminary experiments have shown that the largest effect of
a small α on correlated data is observed in the first iteration, and that a small
α in the first iteration on anti-correlated data does not yield any noticeable difference in execution time, if the remaining iterations are executed with a higher
α. We cater the algorithm towards both types of distributions by using 14 α in
the first iteration, and α in all the remaining iterations. This allows a quick
first iteration with a lot of pruning on correlated data, while the relatively large
α still provides a good ratio on subsequent runs for anti-correlated data. We
investigate the effect of α in more detail in Chapter 4.
24
3.1.6
Sorting data in parallel
In order to fully utilize the GPU, we sort the data in parallel. This is done
by first launching a thread for each point. These threads record their index
into one array, compute the Manhattan norm of their point and record it into
another array. The array containing the indices is then sorted in parallel with
respect to the Manhattan norm array, using the sorting function from the thrust
library [1]. The library uses a highly optimized parallel version of the radix sort,
which is able to sort large arrays very efficiently. Once the sorting completes,
we then use the array of indices to reorganize the data array on the device, so
that the data appear in sorted order. This approach allows us to use the fast
sorting on the GPU, and it saves data transfers to and from the device since
all the operations are performed on the device.
3.1.7
Data transfer on the pci-e slot
Since the pci-e slot is a potential bottleneck, we aim to minimize the amount
of data that is transferred between the host and device. We do this by only
transferring index information between the device and the host, once the full
dataset has been transferred to the device. That is, the sets Ddom and Dh in
Algorithm 2 only contain IDs, and not the that actual data. The skyline is then
build on the host, by utilizing the fact that a copy of the data remains in the
host memory.
3.1.8
Coalesced reading
In every iteration of Algorithm 2 the data becomes fragmented, since some data
points are pruned. To ensure that the data can be read in coalesced manner, we
do a parallel order preserving de-fragmentation of the data after every iteration.
The de-fragmentation is done in two steps. First, while the CPU is filtering the
dataset for the pruned points, it records where non-pruned data points are
located in the data array on the device. This information is then transferred to
the device, and the data is moved from the original positions to the beginning
of the array in batches of α. This is possible since the first α points have
either been pruned, or recorded as skyline points after each iteration. Again,
it is worth noting that only the array with data indices is transferred from and
to the device, and not the data, preventing a potential bottleneck. The defragmented array allows reading of the data in a coalesced manner, and thus a
full utilization of the memory bandwidth.
3.1.9
Utilizing the shared memory
As mentioned, threads are divided into blocks that can access a very fast shared
memory. Since each data point is compared to α other points in each iteration, it
is repeatedly accessed. To minimize fetches from the global memory, the threads
of a block load the points they are responsible for into shared memory at the
beginning of each iteration. To further reduce fetches from global memory, each
of the α data points are also loaded into the shared memory before dominance
25
tests are performed. This allows the threads to do dominance tests on data that
is entirely contained in the low-latency shared memory, rather than having to
load the data from the high-latency global memory. Further more, since all
threads in a block have access to the same shared memory, each of the α data
points only needs to be fetched from the global memory once for each block.
3.1.10
Thread Organization
As we discussed in Section 2.1.3, the limiting factor for the amount of active
threads on the GPU is either the shared memory or the registers for most
applications. Our application is no different, since we heavily rely on the shared
memory (SM) to speed up the computation. To reduce the impact of this
limitation we have chosen to run our algorithm with a fairly large thread block
size of 512. This means that more threads share the same SM and thus the
space allocated for the α points is used more efficiently. However, when the
data dimensionality starts to grow, so does the need for SM to store the points
that should be compared. To avoid having to use a smaller block size, we allow
the space allocated for the α points to be smaller than α. When this is the
case we simply divide α into chunks that will fit in the SM, and iteratively load
these chunks until all α points have been through the SM and compared for
dominance.
3.2
Parallel skycube computation
Now that we have a GPU parallel skyline algorithm, we focus our attention
on computing the skycube in parallel. As mentioned in Section 2.3 this is an
NP-hard problem. However, as described in Section 2.3.2 some techniques can
be applied, to minimize the amount of redundant computations, which we will
also explore here.
We develop the parallel skycube algorithm, by first presenting a naive computation using GGS, and then iteratively improve the solution. To structure the
cuboids we apply the same lattice structure that previous work have used to
represent the skycube [24, 12]. In general this lattice has been iterated either
bottom up or top down yielding different advantages of each approach. The
BUS algorithm [24] iterate the lattice bottom up, sharing results by not doing dominance checks for points that are known to be in skyline. The set of
datapoints to not do dominance checks for is found by taking the union of the
skylines of the child cuboids. Such an approach is not easily adaptable for the
GPU, since the bypass of dominance test will typically lead to branching or
an idle thread. This is the case, since we still need these points to prune out
other points in our GGS algorithm, so we can not just leave them out of the
computation.
The QSkycube [12] and TDS [24] algorithms both iterate the the lattice top
down, the former in a breath first manner and the latter in a depth first manner. They both share results by using the skyline of the parent cuboid as input
dataset for the current cuboid computation. This is an approach that can be
adapted to a GPU setting, since the sharing of results is done in between the
26
cuboid calculations, which means that we do not need to micro-manage the
dominance tests, like for the BUS approach. As such we will be doing our
skycube computations in a top down manner.
3.2.1
The lattice as a datastructure
Figure 3.1 shows the lattice that the original skycube paper [24] used to visualize
the skycube. This can also be found Section 2.3.1, but we repeated it here for
convenience. While the original paper [24] only used the lattice for visualizing
the skycube, it was used as an actual datastructure for the QSkycube algorithm
[12]. This makes good sense, since it is natural to iterate it, the same way you
would iterate a tree. We also intend to use the lattice as a data structure and
so we introduce the terminology we will be using. As can be seen in Figure 3.1
there are links between subspaces and superspaces. We say that a cuboid V is
a parent of another cuboid U if there is a link between the two and U ⊂ V .
Similarly we say that U is a child V . For an example ABD is a parent of AD,
which in turn is a parent of A. In the lattice data structure we have links to
both children and parents in each node so that we can easily access both when
needed.
1
2
3
4
ABCD
ABC
AB
ABD
AC
A
AD
B
ACD
BC
BCD
BD
C
CD
D
Level
Figure 3.1: A skycube lattice
3.2.2
The naive approach
The most naive way of doing the skycube computation is by simply iterating
the lattice, and then computing each of the 2d − 1 cuboids separately. In the
case of GGS, this also includes transferring data to the device 2d −1 times, since
the algorithm "destroys" the original dataset during the de-fragmentation step.
The impact of this can of course be reduced by only transferring the dimensions
needed for each cuboid, but the data transfer overhead will still be large.
The iteration order of the lattice in this case has no effect since each computation is done in isolation from the others. It should be clear that this solution
will always work, although it might be very slow due to the data transfer overhead. A better solution would be to only transfer the data once. This is our
first improvement. We obtain this by skipping the de-fragmentation step in the
original algorithm. Instead we just feed the set of active dimensions to each
27
thread, and read the indices of non pruned data directly from the Dh array.
This approach makes a trade-off between coalesced reading and data transfer across the pci-e slot, since the data read by each thread will no longer be
coalesced. However, we can reduce the impact of this trade-off by still guaranteeing coalesced reading of the data that is read the most, namely the first α
data points. While each active data point is read exactly once in each iteration,
the α points are read once for each thread block. So we further introduce a step
before the parallel run in each iteration which, in parallel and on the device,
transfers the α data points to a separate utility array which will then be read
in a coalesced manner by each block of threads. We present this approach in
Algorithm 3 and 4. First, in line 2-3 of Alogrithm 3 we initialize the lattice
and transfer the dataset to the device. Next, in lines 4-7 we then calculate the
skyline of all non-empty dimension subsets one by one, and store them in the
lattice. The actual calculation takes place in Algorithm 4. This is a modified
edition of GGS as described above. Specifically we now take a set of indices as
input rather than the dataset itself. Furthermore we have also added the active
set of dimensions to the input parameters. In the algorithm we have added
the utility array in line 9, and we transfer the first alpha points to it in lines
10-11. This naive skycube algorithm will be the basis for our optimizations in
this section.
Algorithm 3: NaiveSkycube
Input : D: A dataset
V : The full dimensionality of D
α: Amount of points to compare against in each iteration
Output: The skycube of the dataset
1 begin
2
Latsize ← 2V − 1;
3
Lat ← Latsize ;
4
transfer D to device;
5
for ∀ U ⊆ V so that U 6= ∅ in a level-wise and top-down manner w.r.t
Lat do
6
Dh ← D;
7
Sky ← SubsetGGS(Dh , U, α);
8
Lat ← (U, Sky)
9
end
3.2.3
Utilizing lattice parent
Now that we have a naive solution with the data transfer reduced (i.e., the
dataset is only transferred to the device once) we consider how to share results in
order to reduce the amount of computations needed. Previous work [24, 12] have
restricted their algorithms to the special case of the distinct value condition,
and then added a secondary structure in order to ensure data correctness for
the general case. We take a different approach and aim to integrate support for
28
Algorithm 4: SubsetGGS
Input : D: A set of indices indicating active points
U : the subset of dimensions for which to calculate the skyline
α: Amount of points to compare against in each iteration
Output: The skyline of the dataset indicated by D on the dimension
subset U
1 begin
2
Sky ← ∅;
3
Dh ← D;
4
transfer Dh to device;
5
sort Dh w.r.t U;
6
while Dh 6= ∅ do
7
D0 ← {pi ∈ Dh | i ∈ 0...α};
8
Ddom ← ∅;
9
A ← ∅;
10
for qi ∈ {D | i ∈ 0...α} do in parallel
11
A ← qi s;
12
13
14
15
16
17
18
19
20
21
22
23
for ∀p ∈ Dh do in parallel
for qi ∈ A do
if BranchFreeDominance(qi ,p,U ) then
Ddom ← p;
return;
transfer Ddom to the host;
Dh ← Dh \ Ddom ;
Sky ← Sky ∪ (Dh ∩ D0 );
Dh ← Dh \ D 0 ;
transfer Dh to device;
D ← Sky
end
29
the general case directly in our algorithms.
The traversal order of the lattice is important because it allows reuse of previous
results, to speed up the remaining calculations. Specifically, for a given subset
of dimensions U ⊂ V we can use the result of the parent V , as input for the
computation of U . This have been shown not to produce false negatives under
distinct value condition in [24]. However, since we want to be able to compute
the skycube in the general case we need to ensure that we include all possible
candidates in the result retrieved from the parent. To do this we now introduce
the strict dominance relation (first presented in [21]).
Definition 6 Strict Dominance
Given a set of d-dimensional points D, and two points p, q ∈ D, then p strictly
dominates q if and only if ∀i ∈ d pi < qi . We write this as p ≺ q.
If we substitute this dominance relation for the one we have been using so far,
we get what is known as the extended skyline [21]. This is a superset of the
actual skyline, which contains the skyline points of all possible subsets of the
data dimensions (also proven in [21]). That is, for all possible subsets U ⊂ V ,
the skyline of U is contained in the extended skyline of V . Actually, in case we
have happen to have the distinct value condition fulfilled, the extended skyline
and the skyline are the same. We suggest computing both the extended and
the actual skyline as we iterate the lattice. The extended skyline can then be
retrieved from parents in the lattice and used as dataset input in the children,
without any false negatives and without any assumptions on the data. To
minimize storage requirement we store the extended skyline, as the skyline and
the difference between the extended skyline and the skyline. This way the
two sets can be combined to form the extended skyline when needed, without
storing the information on each skyline point more than once.
In order to utilize the extended skyline, we start using the dominance test
in Algorithm 5 instead of the one we have used so far. We have made two
changes to integrate the extended skyline. First, we have added an extra input
parameter which is updated in line 9 each time the dominance test is completed.
This indicates whether a given point belongs to the normal skyline. Second,
we have added an equality check in the for loop, and changed the return value
to represent whether p strictly dominates q. As can be seen we therefore have
the ability to compute both dominance and strict dominance in one run. This
dominance check is then used in Algorithm 8. This algorithm is a modified
edition of Algorithm 4. We have added a parameter so that we can also return
the extended skyline, however the major difference is that we now compute the
extended skyline rather than the normal skyline. As a consequence, points are
only eliminated in lines 21-23 if they are dominated w.r.t to the strict dominance
relation. We compute the actual skyline as a side-effect by introducing a boolean
variable in line 14. This is then updated by the dominance check, and reflects
whether the point p is dominated w.r.t to the normal dominance relation. If the
point p is not strictly dominated we then check if it would have been dominated
in lines 18-19, and record it on the host in line 24. Next, when the while loop
is done, we record the actual skyline by taking the set difference between the
extended skyline and the recorded points in line 26. That is, we remove the
30
points from the extended skyline that would have been dominated in a normal
skyline computation. Once we have this set, we can then execute the same
computation to get those points that belong the extended skyline, but not the
skyline in line 27. This way we reduce the space occupied, while still retaining
all the needed information. Algorithm 8 is used by Algorithm 6 to do the
skycube calculation. Like Algorithm 8 was a modification of Algorithm 4, so
is Algorithm 6 a modification of Algorithm 3. The main addition is the ifelse statement in lines 6-10. These lines basically check if we are computing
the full skyline or a subspace skyline. In the first case we work on the entire
dataset, while in the latter case we use the extended skyline as input for the
computation. We also initialize and save the extended skyline in lines 11 and
13 respectively. This is basically an adaptation of an idea that was first used
in the TDS algorithm[24] with the alteration that the skyline was computed
for TDS [24] while we compute and utilize the extended skyline. However, the
authors of [12] observed that there was room for more sharing of information.
Basically, they observed that since we can use any superspace skyline for the
computation of the subspace skyline, the subspace skyline must be contained
in all of the superspace skylines (again, under distinct value condition). As
such they chose to take intersection of all superspace skylines as input to the
computation of the subspace skyline. We now observe that the same rule applies
to the extended skyline, and as such we can modify Algorithm 6 and obtain
Algorithm 7. Basically the difference is that we now use the intersection of the
extended skylines of all the parents in the lattice as an input to the skyline
computation. This intersection happens in lines 8-9.
Algorithm 5: StrictBranchFreeDominance
Input : Two d-dimensional data points p and q
a boolean dominated
a subset of dimensions u ⊆ d
Output: Whether p strictly dominates q on u
1 begin
2
bool pbetter ← f alse;
3
bool qbetter ← f alse;
4
bool eq ← f alse;
5
for i ∈ u do
6
pbetter ← pi < qi ∨ pbetter ;
7
qbetter ← qi > qi ∨ qbetter ;
8
eq ← qi == qi ∨ eq;
9
10
11
dominated = (¬qbetter ∧ pbetter ) ∨ dominated;
return ¬qbetter ∧ ¬eq ∧ pbetter
end
31
Algorithm 6: SkycubeSingleParent
Input : D: A dataset
V : The full dimensionality of D
α: Amount of points to compare against in each iteration
Output: The skycube of the dataset
1 begin
2
Latsize ← 2V − 1;
3
Lat ← Latsize ;
4
transfer D to device;
5
for ∀ U ⊆ V so that U 6= ∅ in a level-wise and top-down manner w.r.t
Lat do
6
if U = V then
7
Dh ← D;
8
else
9
select a random superspace Y of U so that |Y | − |U | = 1;
10
Dh ← Lat(Y ).Skyext ∪ Lat(Y ).Sky;
Skyext ← ∅;
Sky ← SubsetGGSExtended(Dh , U, α, Skyext );
Lat ← (U, Sky, Skyext )
11
12
13
14
end
Algorithm 7: SkycubeMultiParent
Input : D: A dataset
V : The full dimensionality of D
α: Amount of points to compare against in each iteration
Output: The skycube of the dataset
1 begin
2
Latsize ← 2V − 1;
3
Lat ← Latsize ;
4
transfer D to device;
5
for ∀ U ⊆ V so that U 6= ∅ in a level-wise and top-down manner w.r.t
Lat do
6
Dh ← D;
7
if U 6= V then
8
for ∀ Y so that U ⊂ Y ∧ |Y | − |U | = 1 do
9
Dh ← Dh ∩ (Lat(Y ).Skyext ∪ Lat(Y ).Sky);
Skyext ← ∅;
Sky ← SubsetGGSExtended(Dh , U, α, Skyext );
Lat ← (U, Sky, Skyext )
10
11
12
13
end
32
Algorithm 8: SubsetGGSExtended
Input : D: A set of indices indicating active points
U : the subset of dimensions for which to calculate the skyline
α: Amount of points to compare against in each iteration
Skyext : A set used to store the extended skyline
Output: The skyline of the dataset indicated by D on the dimension
subset U
1 begin
2
Sky ← ∅;
3
Dh ← D;
4
transfer Dh to device;
5
sort Dh w.r.t U;
6
while Dh 6= ∅ do
7
D0 ← {pi ∈ Dh | i ∈ 0...α};
8
Ddomext ← ∅;
9
Ddom ;
10
A ← ∅;
11
for qi ∈ {D | i ∈ 0...α} do in parallel
12
In all of our GPU-based skycube al A ← qi s;
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
for ∀p ∈ Dh do in parallel
dominated ← f alse for qi ∈ A do
if StrictBranchFreeDominance(qi ,p,dominated,U ) then
Ddomext ← p;
return;
if dominated then
Ddom ← p
transfer Ddom and Ddomext to the host;
Dh ← Dh \ Ddomext ;
Skyext ← Skyext ∪ (Dh ∩ D0 );
Dh ← Dh \ D 0 ;
Sky ← Sky ∪ Ddom ;
transfer Dh to device;
Sky ← Skyext \ Sky;
Skyext ← Skyext \ Sky;
return Sky;
end
33
3.2.4
Minimizing sorting
So far we have been sorting the input set of indices 2d − 1 times, once for each
cuboid. While this sorting is done very efficiently on the GPU, the overhead
still remains in each cuboid computation. In the BUS algorithm [24] it is suggested, that the amount of sorting needed can be reduced from 2d −1 to d. This
is done by first sorting the dataset on each of the individual dimensions, and
then use one of these sorted orders to organize the input of the computation
for each cuboid. While such a sort indeed reduces the pruning efficiency, it
also minimizes the computational overhead from the sorting. We therefore suggest this approach as an alternative to sorting before each cuboid computation.
However, we do this with a couple of modifications. First, as we have observed
previously, the extended skyline of the full dataset will contain all skyline points
of all the cuboids. We therefore suggest to first compute the extended skyline
of the full dimensionality as previously and then sort the resulting dataset on
each dimension. Thus, this leads to d + 1 sorts, but the last d sorts will be
performed on a reduced dataset (i.e., the extended skyline), and the resulting
lists of indices, which will be used for all subsequent computations, will have a
smaller size (except for the special case where the skyline consists of the entire
datset).
The suggested sorting works when we assume distinct value condition. In the
general case, we need to ensure strict monotonicity on dimensions with duplicate
values. To do this, we extended the sorting by also recording which dimensions
contain distinct values. This information is then used when we choose a list to
use as a basis for the computation of a cuboid. Here we have two cases. In
the simple case, one of the dimensions in the cuboid corresponds to a sorted
list with distinct values. In this case we choose this list as our sorted order,
and do the skyline computation for the cuboid. In the other case, we choose a
random dimension in cuboid and iterate the corresponding sorted list. During
the iteration, we look for values on the chosen dimension that are equal. When
we find equal values, we sort each group by the Manhattan norm on the dimensions of the cuboid. This is necessary since we need a strict order, for our sort
based approach to work. When the data in sorted list is equal on the sorting
dimension, we do not know the order of data with the same values on that
dimension. By sorting these local sets by the Manhattan norm we establish the
monotonicity, and thus produces an ordering which we can use for the cuboid
computation.
We implement this by two modifications to Algorithm 7 which are presented in
Algorithm 9. First, in lines 9-13 we choose the sorted list to use as discussed
above. The second modification is added in lines 19-21, where we do the actual
sort on each of the dimensions. Notice that we use the extended skyline as
input rather than the full dataset, as explained earlier. The last modification
that needs to be made is simply to ignore line 5 (sorting) in algorithm 8 for all
but the first computation.
34
Algorithm 9: SkycubePresort
Input : D: A dataset
V : The full dimensionality of D
α: Amount of points to compare against in each iteration
Output: The skycube of the dataset
1 begin
2
Latsize ← 2V − 1;
3
Lat ← Latsize ;
4
SortedLists ← |V |;
5
transfer D to device;
6
for ∀ U ⊆ V so that U 6= ∅ in a level-wise and top-down manner w.r.t
Lat do
7
Dh ← D;
8
if U 6= V then
9
if ∃d ∈ U so that SortedLists[d].distinct = true then
10
Dh ← SortedLists[d];
11
else
12
choose random d ∈ U ;
13
Dh ← iterate SortedLists[d] and sort equal values on U ;
for ∀ Y so that U ⊂ Y ∧ |Y | − |U | = 1 do
Dh ← Dh ∩ (Lat(Y ).Skyext ∪ Lat(Y ).Sky);
14
15
Skyext ← ∅;
Sky ← SubsetGGSExtended(Dh , U, α, Skyext );
Lat ← (U, Sky, Skyext );
if U = V then
for d ∈ V do
SortedLists ← sort Skyext w.r.t d;
16
17
18
19
20
21
22
end
35
3.2.5
Parallelization of the CPU part
Next we turn our attention on the CPU part of the algorithm. Specifically
we observe that the number of superspace extended skylines that needs to be
intersected in Algorithm 7 can become large as the dimensionality of the dataset
increases. Furthermore, almost all modern computers have multi-core CPUs,
which allow efficient task parallel computations. Thus, we suggest computing
the intersection of superspace skylines in parallel with the GPU computation.
This can easily be done by modifying Algorithm 7, as we have done in Algorithm
10. In lines 6-8 we first compute the full skyline, and put the extended skyline
in a variable Dnext . This is then copied to Dh in line 11, and used for the actual
skyline computation in lines 18-20. However, before this, we start a separate
CPU-thread in lines 12-16, which calculates the input for the next run of the
algorithm, while the GPU execution computes the current cuboid. In case that
the GPU parallel run finishes before the CPU thread terminates we join with
it in line 21.
Algorithm 10: SkycubeCpuParallel
Input : D: A dataset
V : The full dimensionality of D
α: Amount of points to compare against in each iteration
Output: The skycube of the dataset
1 begin
2
Latsize ← 2V − 1;
3
Lat ← Latsize ;
4
transfer D to device;
5
Dh ← D;
6
Skyext ← ∅;
7
Sky ← SubsetGGSExtended(Dh , V, α, Skyext );
8
Lat ← (V, Sky, Skyext );
9
Dnext ← Lat(V ).Skyext ∪ Lat(V ).Sky;
10
for ∀ U ⊂ V so that U 6= ∅ in a level-wise and top-down manner w.r.t
Lat do
11
Dh ← Dnext ;
12
Z ← next subset of Lat in sorted order;
13
Cputhread(Z Dnext )
14
Dnext ← Dh ;
15
for ∀ Y so that Z ⊂ Y ∧ |Z| − |U | = 1 do
16
Dnext ← Dh ∩ (Lat(Y ).Skyext ∪ Lat(Y ).Sky);
17
Skyext ← ∅;
Sky ← SubsetGGSExtended(Dh , U, α, Skyext );
Lat ← (U, Sky, Skyext );
Join cpu;
18
19
20
21
22
end
36
3.2.6
Optimizing for incomparable points
While the usage of extended skylines can potentially speed up the computation
of the skycube, it also significantly changes the ratio of skyline to nonskyline
points in the data, both extended and not. The result is that as we reach the
lower levels of the lattice, relatively few datapoints are pruned, for each cuboid
computation. Since all points in a skyline are either incomparable or equal to
each other, computations might be saved if we are able to bypass dominance
checks of points that are known to be incomparable. This observation is the
basis of the work presented in [11] which we will now review. The basic idea is to
first choose a pivot point in the data. Next each of the remaining datapoints is
distributed into one of 2d − 1 groups based on this pivot point. The distribution
is done by assigning a binary vector to each of the datapoints. This vector is
created by setting the ith bit to 1 if a given datapoint is strictly worse than the
pivot point on the ith dimension. Otherwise it is set to 0. This distribution
provides several opportunities for reducing the amount of computations needed.
First, all points that are assigned a vector of all 1’s are strictly dominated by the
pivot point, and can therefore be pruned safely. Secondly if the binary vectors of
two datapoints p and q have the following property ∃i, j ∈ d so that p.Bi > q.Bi
∧ p.Bj < q.Bj then p and q are incomparable and thus do not need to be tested
for dominance. Thirdly, a datapoint p can only dominate another datapoint q
if the binary vectors have the following property: ∀i ∈ d p.Bi ≤ q.Bi . This is
known as partial dominance and we write it as p.B 4par q.B. These properties
was proven in [11] and it is clear that such a mapping provides a powerful tool
to bypass computations. The original paper presented two algorithms utilizing
this principle. The first uses the distribution as a preprocessing step to a sortbased algorithm. Once the binary vectors have been assigned they can be used
to bypass computations as the data is iterated to compute the skyline. This is
done by comparing the binary vectors before each dominance tests since it is
much cheaper to compare the binary vectors. The other algorithm recursively
partitions the data using the presented method (but with a new pivot point for
each recursive call), until all groups only contain one element. These groups
can then be efficiently merged to form the skyline. In fact, the current stateof-the-art skycube algorithm, QSkyCube [12], is heavily based on the partition
based approach from [11].
Since threads are not able to launch new threads on the GPU, it is not feasible
to try and adapt the recursive partition based algortihms. However, with some
modifications, we can adapt the sort based approach. We do this by first sorting
the data w.r.t to the Manhattan norm as usual. Next we then use the first point
in this sorting as the pivot point, and assign a binary vector to each datapoint
as described above. This can also be done in parallel. We can then use a stable
sort, to sort the indices w.r.t to the binary vectors. The difference between a
stable sort and a normal sort is that the original order is maintained for data,
that is equal on the sorting criteria. We obtain a grouping w.r.t to the binary
vectors, where each group remains sorted w.r.t the Manhattan norm internally.
We can then compute the skyline and extended skyline in two steps. First each
thread iterates its own group from the first point in the group until it reaches
37
its own index in the group. If the associated point p has not been pruned yet,
the thread then iterates over the other groups, and tests p against the skyline
points of those groups. A thread skips a group if the groups binary vector is
incomparable to or does not partially dominate its own. That is, before iterating
a group, the thread tests its own binary vector against the binary vector of the
group to see if the group may be skipped.
We adapt this approach in Algorithm 11. In lines 2-5 we initialize the arrays
we need for the computation. Next we sort the array of indices according to the
Manhattan norm and assign a binary vector to each point in lines 6-8. We then
use a stable sort from the thrust library [1] to sort both Dh and Dbinary w.r.t
to Dbinary on the device. Once the sort is done, we then transfer Dbinary back
to the host in line 10 and iterates it in line 11. In this iteration we record the
first index of each binary value in Dstart and we record the size of each group
in Dsize . We then transfer these arrays to the host on line 12, so that we can
use them in the parallel run. We do a single parallel pruning run in lines 13-31.
In line 14-15 we initialize the variables needed for each point. Next each thread
then iterates over its own group from the beginning and until it reaches its own
index. We only need to iterate over this range, since the points in the group
remains sorted w.r.t. the Manhattan norm. As in GGS, we start at the point
with the smallest Manhattan norm since this has the best pruning power. In
line 17 we test to make sure that the next point has not already been pruned,
and if the point p gets pruned we indicate it by setting its index to -1 in line
19. This way, we only compare points that are not known to be pruned. If
a thread have not terminated after iterating its own group, it then moves on
to iterate over the other groups in lines 21-28. It does this by iterating over
the binary values, and testing each binary value for partial dominance in lines
21-23. For each binary that partially dominates p’s binary we iterate over the
corresponding group in lines 21-24. Again we iterate over the group beginning
at the point with the smallest Manhattan norm, and set the index of p to -1
if it is pruned in line 27. If the thread has not terminated at line 29, then p
belongs to the extended skyline, and the variable dominated indicates whether
it also belongs to the skyline. This is recorded in line 29-32. As the last step we
transfer the skyline and extended skyline back to the host and return in lines
33-34.
We can then use this algorithm in a skycube computation by simply replacing
the call to Algorithm 8 in line 19 of Algorithm 10 with a call to Algorithm 11.
While the approach above may be more efficient w.r.t. the amount of dominance
tests performed, it does have some drawbacks when adapted to the GPU, since
the number and sizes of the groups the data is distributed to can vary greatly.
While this is not a problem for sequential algorithms, it presents a potential
problem for the GPU. The reason for this is that some threads may end up
doing a lot more computations than others. This becomes the case, when some
threads have to iterate over several large groups to ensure that their data point
is in the skyline, while other may only have to iterate over few small groups
to reach the same conclusion. This uneven distribution of the workload may
lead to long execution times since the algorithm needs to wait for a relatively
small amount of threads to finish. As such this approach trades off a high
38
computational throughput for fewer over-all computations. Since the potential
for saving computations is larger in large datasets, it is natural to assume that
more speed-up will be gained from this approach when the dataset is large. In
fact, this approach might be slower than GGS for small datasets.
We therefore suggest one last optimization. We introduce a threshold β so that
if the dataset is larger than β then we use the partition based approach, while
we use a simplified edition of Algorithm 8 if the dataset size is smaller than
β. The simplification basically consists of adjusting α to the dataset size, and
skipping the sort. As such all data points are compared to all data points. This
way, we utilize the saving of computations for large datasets, while maintaining
the higher computational throughput for smaller datasets. We test how the
value of β affects the computation in Chapter 4.
Shared Memroy usage and thread organization
Unlike Algorithm 8, Algorithm 11 cannot reuse the thread and shared memory organization presented for GGS, since it is a very different approach. We
therefore present a strategy for how to organize the threads and memory for
Algorithm 11 here. We first consider how to assign threads to each of the
groups generated. Since the amount and sizes of the groups can vary greatly,
we must be able to support this. At the same time we still want to minimize branch divergence since, this can decrease performance substantially. We
therefore assign whole thread blocks to the groups, until each datapoint in each
group have been assigned a thread. While this might result in thread blocks
where only few threads are running(due to small group sizes), it ensures that
the threads in a given block always evaluate the same when iterating over the
binary vectors. Thus the threads in a block will all iterate the same data while
they are active. Besides avoiding branch divergence, this also allow us to utilize
the global memory cache. Since all threads in a block always diverge in the
same manner, they also always load the same data from global memory. Since
the global memory is cached, many load requests will result in a cache hit and
thus reduce the impact of the potentially uneven distribution of work load. We
executed preliminary experiments, and found that 128 is a good thread block
size, since it is large enough to allow efficient computation when all threads in
a block are active, while it is also small enough to not lock too many resources
when many threads threads are not active, because of small group sizes.
We decided to still load the point that each thread is responsible for into shared
memory, since it potentially will be accessed many times. However we were not
able to construct an efficient way of loading the points that we compare to into
shared memory. The reason is that threads terminate if they do not have a
point assigned or when the point that they do have assigned is pruned. This
does not allow any cooperation in loading data, and so we simply load the data
that each thread compares to from global memory. However, we do this while
attempting to utilize the global memory as described earlier.
39
Algorithm 11: SpacePartitionCalculation
Input : D: A set of indices indicating active points
U : the subset of dimensions for which to calculate the skyline
Skyext : A set used to store the extended skyline
Output: The skyline of the dataset indicated by D on the dimension
subset U
1 begin
2
Sky ← ∅;
3
Dh ← D;
4
Dstart ← ∅, Dsize ← ∅, Dbinary ← ∅;
5
copy Dh , Dbinary , Sky and Skyext to device;
6
sort Dh w.r.t U;
7
for ∀p ∈ Dh do in parallel
8
Dbinary ← binary vector for p w.r.t Dh [0];
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
stable sort Dh and Dbinary w.r.t to Dbinary ;
copy Dbinary to the host;
iterate Dbinary recording first index and size for each group in Dstart
and Dsize ;
transfer Dstart and Dsize to the device;
for ∀p ∈ Dh do in parallel
dominated ← f alse;
pbinary ← Dbinary [p.index];
for qi ∈ {D | i ∈
Dstart [pbinary ] .Inallof ourGP U − basedskycubeal.. p.index − 1}
do
if Dh [i] 6= −1 then
if StrictBranchFreeDominance(qi ,p,dominated,U ) then
Dh [p.index] ← −1;
return;
for ∀j ∈ Dstart do
nextbinary ← Dbinary [Dstart [j]];
if nextbinary 4par pbinary ∧ nextbinary 6= pbinary then
for qi ∈ {D | i ∈ Dstart [j] ... Dstart [j] + Dsize [j]} do
if Dh [i] 6= −1 then
if StrictBranchFreeDominance(qi ,p,dominated,U )
then
Dh [p.index] ← −1;
return;
if dominated then
Skyext ← p;
else
Sky ← p;
transfer Sky and Skyext to host;
return Sky;
end
40
3.2.7
Utilizing cuda streams
The last optimization we want to explore is based on the findings of preliminary
experiments. We have observed that the reduction in dataset size obtained
by intersecting multiple parents in some cases reduces the input size for the
next computation to the extent where the GPU ends up being underutilized.
That is, due to the small input dataset size, we cannot parallelize the skyline
computation enough to take full advantages of the GPU. To address this we
utilize the concept of cuda streams [14]. The idea of cuda streams is that one can
execute several programs concurrently on the GPU if the resources are available.
The normal execution is done on the default stream and all instructions, such
as data copies or function executions are done in order. The same thing is true
when using multiple streams. That is, instructions sent to a given stream are
executed in order. When multiple streams are used, the GPU then has the
option of executing tasks from several streams at once, if the resources needed
are available. This provides the GPU with enough work, simply by calculating
several cuboids at once. The usage of cuda streams is implemented by starting
a CPU thread per CPU core available. Each CPU thread is assigned a cuda
stream, and then executes as it normally would in Algorithm 8. The only
restriction we put on this is that the CPU parallel computation is restricted to
one level in the lattice at a time. This is done to ensure that the intersection
of parent skylines is still possible (i.e. we need the parent cuboid computations
to be done, before we can intersect them).
3.2.8
Utilizing the constant memory
In all of our GPU-based skycube algorithms, we only transfer the data to the
device once. As such we need to know what dimensions are active for each
cuboid calculation. For this purpose we use the constant memory. The constant
memory is a cached part of global memory that is capable of broadcasting a
value to all threads in a warp in a few clock cycles if all the threads asks for the
same address. Before the parallel run of each cuboid calculation we transfer
the dimension set to the constant memory. Since the active threads in a given
warp all execute the same instructions, they will all ask for the same dimension
at the same time, and thus get the value broadcasted. It should be mentioned
that the constant memory is only as efficient as reported, if the requested data
already is in the constant memory cache. However, the constant memory cache
size is 8KB so we need extremely high-dimensional data for this to become an
issue.
3.3
Summary of parallel skyline and skycube computation
In this chapter we have developed an efficient GPU-based skyline algorithm and
used it as a basis for developing a skycube algorithm. We have presented several suggestions for optimizations of this algorithm, all of which introduce some
trade-off. In our naive algorithm we traded off coalesced reading of data, to
41
minimize the amount of data transferred to the device. Next we introduced the
extended skyline. This allowed us to reduce the input data size for all but one
cuboid computation. However, computing the extended skyline introduced an
extra value comparison in the dominance check (the equality check), and thus
adds extra computations. Furthermore the extended skyline is larger than the
skyline and therefore it can be more expensive to compute. We also suggested
merging the extended skyline of all parent cuboids to further reduce the dataset
to be processed. This of course comes at the computational cost of intersecting
many potentially large lists. To minimize the impact of this we proposed doing
the intersection in a separate CPU thread and thus introduce parallel execution
on both the CPU and the GPU.
As the next optimization we adopted some data partitioning to save comparisons between incomparable datapoints. This saving comes at the cost of computational throughout since the partitioning does not create groups of equal
size. Recognizing this fact we introduced a hybrid approach that, based on a
size threshold, chooses whether the data-partitioning should be done or not.
The last approach we presented attempted to fully utilize the GPU by overlapping computations using streams. This due to the oberservation that the GPU
may become underutilized when the input to the cuboid calculations becomes
small.
Next we will test the different approaches to evaluate how each of the trade-offs
affects the computational performance.
42
Chapter 4
Experiments
In this chapter we evaluate our proposed solutions. We do this by first testing
how varying the values of α and β affects the performance of the algorithms
that utilize these parameters. Next, we then compare our solutions to the
current state-of-the-art skycube algorithm, QSkyCube [12], on both synthetic
and real datasets. All experiments are performed on an Intel i7-2600 3.4Ghz
quad core processor with 8GB 1333Mhz RAM and NVIDIA gtx670 graphics
card with 4GB of ram. The operating system used is Ubuntu 12.04, with
CUDA driver and CUDA runtime system in version 5.0. The authors of [12]
have been kind enough to provide the optimized C++ source code for their
sequential algorithm, while our algorithms have been implemented in the Cuda
C programming language. The GGS based algorithms have been executed with
a thread block size of 512, while the partition based algorithms were executed
with a thread block size of 128. These values have been chosen since they showed
the best performance on the given hardware in preliminary experiments. The
data is organized as a one-dimensional array on the device, and all reported
runtimes are averages of 10 runs unless otherwise specified.
4.1
Synthetic data
In the skyline research community it has become de-facto standard to test
algorithms on synthetic data with three distributions called correlated, anticorrelated and independent. We describe each in more detail below, and give
examples of each distribution from the toy dataset presented in Table 1.1.
Correlated If a datapoint in a correlated dataset is good on one dimension,
it tends to also be good on the others. Similarly if a point is bad on one
dimension, it also tends to bad on the others. That is, there is a relation
between the dimensions, so that they score comparably in a skyline query.
As such, skylines on correlated data typically only contains few points,
since a few very good points normally dominate a large portion of the
data. An example of data that typically is correlated is kilometers per
liter and engine size (called Vol(cc) in the table), although the age might
also be a factor since older engines tend to be less efficient.
43
Independent Independent data distribution, is a uniform distribution without
any relation between the individual dimensions of a given datapoint. This
kind of distribution typically produce a larger skyline than the correlated
data, since there are no point that stands out as especially good. An
example of independent data could be the amount of doors and the age
of the car. Indeed the amount of doors does not affect the age of the car,
nor does the age affect the amount of doors.
Anti-correlated data Anti-correlated data, is data where point that are good
on one dimension, tends to be bad on the others. This typically leads to
large skylines (larger than those for independent distribution), since a lot
of datapoints will be incomparable when there are no points that stand
out as good on several dimensions. An example of anti-correlated data
is price and kilometers driven, since fewer kilometers typically means a
higher price.
As can be seen from the list above, the three distributions have two major
advantages. First, they replicate common data distributions found in the real
world which makes them very relevant for testing how well an algorithm performs in different cases. Secondly, the resulting skyline size vary greatly in size
for the same dimensionality and cardinality. This is important since the skyline
size greatly influences the execution time of any skyline or skycube algorithm.
The reason for this is that a point cannot be verified as a skyline point, unless
it has been compared to all points that might dominate it. Despite our partitioning approach, this set is still significantly larger for large skylines.
All data is generated using a generator provided by the authors of [3].
4.2
Varying α
The first experiment we run is a test of how the value of α (the batch size)
affects the execution time of the GGS based algorithms. Since the algorithms
that use the α value all use the same algorithm for computing the cuboids we
have chosen only to run these experiments on Algorithm 10. We test the algorithm on synthetic data generated with correlated, anticorrelated and uniform
distributions as described in Section 4.1. We vary both the cardinality and the
dimensionality for all three distributions to investigate if these are affected by
the choice of alpha. The dimensionality is evaluated in the range of 4 to 10
while the cardinality is varied from 10K to 100K.
We have tested values of α, ranging from 32 to 8192 for all datasets. Since we
knew that this was an optimization problem we decided the range by running
preliminary experiments with values lower and higher than the specified range,
until we were sure that we had captured the optimal range. Since the major
development happens at the beginning of the range, we decided to run experiments with smaller intervals for the first 1024 values. Finally we note that the
values tested are all multiples of 32, since the GPU executes threads in warps
of 32. Since we copy the α points to a separate utility array, multiples of 32
provides the best thread utilization, since all threads in active warps will be
44
200
●
4
6
8
10
150
10000
100000
300000
500000
700000
1000000
100
time(sec)
32
256
512
1024
2048
3072
4096
5120
6144
7168
8192
(a) Cardinality (d=10)
●●● ●
●
●
●
●
●
●
●
●
8192
●
7168
●
6144
●
5120
●
4096
●
3072
●
2048
●
32
256
512
●
0
●●● ●
1024
50
200
0
100
time(sec)
300
400
●
(b) Dimensionality (n=500K)
Figure 4.1: Anticorrelated data
working.
As mentioned previously the value of α determines how many datapoints are
compared to the entire dataset in each iteration of the GGS based algorithms.
This is an optimization problem, since a high α means few iterations with many
threads running, while a small α means more iterations, but potentially with
fewer threads running. Since anti-correlated data typically produces a large
skyline and correlated data typically produce a small skyline, we would expect
these distributions to run faster with a large and small α respectively, while
the independent should fall somewhere in-between. We consider each of the
data-distributions individually.
Figure 4.1 plots the result for the anti-correlated data. Note that the samples
rate on the x-axis is higher in the beginning as discussed above. We first look
at Figure 4.1(a) which varies the cardinality, with the dimensionality set at 10.
Each line represents a different dataset, and the changes in execution time is
only affected by the changes in the α values, which are plottet on the x-axis. It
is clear to see that the value of α have an effect on the running time. Indeed
a small value of alpha yields a higher running-time until we reach an α value
of 1024. At this point the running-time of all the datasets starts to rise slowly.
While the effect is less visible for the low dimensionality and cardinality it is
still there, and the best recorded value for α for all tests was 1024. While a
tipping point were to be expected, it is very interesting to see that the data
produces such a clear candidate for the alpha value. We also see that the choice
of α is not affected by the cardinality of the data. All datasets follow the same
general pattern, although the effect in terms of seconds gained or lost is higher
for the larger datasets. In Figure 4.1(b) the dimensionality is varied with a
cardinality set at 500K. The effect of the dimensionality on skycube algorithms
is clearly shown in this figure. Remember that the skycube computes all 2d − 1
non-empty subspace skylines. I.e. the 8 dimensional dataset only have 255 subspaces, while the 10 dimensional dataset have 1023. Besides the large difference
45
15
30
10000
100000
300000
500000
700000
1000000
4
6
8
10
time(sec)
20
5
15
3072
4096
5120
6144
7168
8192
●
●
●
●
●
●
●
8192
2048
●
7168
1024
●●● ●
6144
●
5120
●
4096
●
3072
●
2048
●
1024
●
32
256
512
●
0
●
0
●●● ●
32
256
512
5
10
time(sec)
●
10
25
●
(a) Cardinality (d=10)
(b) Dimensionality (n=500K)
Figure 4.2: Independent data
in execution time between the subspaces, we observe the same pattern. 1024 is
also the value for α when we vary the dimensions, although the gain in seconds
is higher for the high-dimensional dataset.
In Figure 4.2 we look at the independently distributed data. Both figures show
a pattern that is very similar to that of the anti-correlated data. The variation
in cardinality and dimensionality does not affect the choice of α, which again
is optimal at 1024. However, it is worth noticing that a higher α value have
a much greater impact on the execution time for independent datasets than
it had for the anticorrelated data. This makes sense, since the skyline size is
smaller for this distribution. Thus, a higher value for α will compare more
points unnecessarily in the first iterations for each cuboid calculation.
Figure 4.3 plots the data for the correlated data. Interestingly we see a different
pattern, than we did for the two other distributions. While both the independent and anticorrelated data runs fasted at a value of 1024, this distribution
clear runs faster for smaller values of α for both cardinality and dimensionality.
This might be expected, since the correlated distribution produces a small skyline, and therefore a lot of non-skyline points are pruned in the first iteration
for each cuboid computation. That is, few points are able to prune most of the
dataset and so a small α value can reduce the active dataset very efficiently
in this kind of distribution. Hence we confirm that a small value of α is more
efficient for correlated data distributions.
It should be noted that the the y-axis have been adjusted to the time intervals
that fit each of the distributions. Since the anti-correlated data produces larger
skylines, more data must be compared, and thus the execution time is much
higher than that of the independent and correlated data. In fact, while all of the
correlated experiments finished in less than a second, regardless of the choice of
α, the execution of the anti-correlated data took several 100s of seconds for the
largest dataset. Since we typically do not know the distribution of the data we
will be working we work on the goal is to optimize towards good performance on
46
1.0
0.8
1.0
0.8
●
●
(a) Cardinality (d=10)
●●● ●
●
●
●
●
●
●
●
●
5120
6144
7168
8192
0.0
0.2
10000
100000
300000
500000
700000
1000000
8192
7168
6144
5120
4096
3072
2048
32
256
512
1024
0.0
0.2
●
4096
●
3072
●
2048
●
32
256
512
●
1024
●
0.6
●
time(sec)
●
4
6
8
10
0.4
0.6
●● ●
0.4
time(sec)
●
(b) Dimensionality (n=500K)
Figure 4.3: Correlated data
all kinds of data. We have observed that both independent and anti-correlated
distributions performed best with an α value of 1024. Furthermore we observe
that the performance for the correlated data is still very good for an α value of
1024, and that the algorithm runs much faster for this distribution. This leads
us to the conclusion that 1024 is the best choice for α. As such this is the value
we will use in the remainder of our experiments.
4.3
Varying β
Next we look at the effect of varying β, which is the value that determines
when we switch pruning strategy in our hybrid algorithm. We introduced the
hybrid algorithm as a consequence of the fact that partitioning is most effective
on large input datasets. As such the hybrid algorithm switch to a simplified
edition of GGS once the size of the input dataset becomes less than β. As with
Section 4.2, we run our experiments on synthetic data with correlated, anticorrelated and independent data distributions while varying dimensionality and
cardinality in the same ranges. We vary the value of β in the range between 32
and 8192, as we did in Section 4.2.
Since the correlated distribution is the one that produces the smallest skyline
size we expect this distribution to be most affected by the hybrid approach,
while the independent should be less affected and the anti-correlated distribution should be affected the least. The last two distributions are expected to be
less affected, since most of the extended subspace skylines computed for these
distributions are expected to have a size large enough, for the partition based
strategy to still be efficient. Again we look at the distributions one by one.
First we look at the anti-correlated data in Figure 4.4. As we expected this
distribution is not very affected as we vary the value of β, neither on cardinality
nor dimensionality. In fact we did not register a change in runtime exceeding
1 second. The reason for this is that the partition based approach remains
47
100
4
6
8
10
60
time(sec)
32
256
512
1024
2048
3072
4096
5120
6144
7168
8192
(a) Cardinality (d=10)
●●● ●
●
●
●
●
●
●
●
●
8192
●
7168
●
6144
●
5120
●
4096
●
3072
●
2048
●
1024
●
0
●●● ●
32
256
512
20
40
150
100
0
●
80
10000
100000
300000
500000
700000
1000000
50
time(sec)
200
250
●
(b) Dimensionality (n=500K)
Figure 4.4: Anticorrelated data
efficient on most of the subspaces. Therefore we do not observe substantial
changes when we introduce the alternative approach for smaller datasets.
For the independent data, in Figure 4.5 the picture is a bit different. Here we
can actually see the simplified GGS approach affecting the running-time. This
is especially true for the small dataset when we vary cardinality, but we registered an improvement in the running-time on all datasets for large values of
β. The fact that the improvement is more visible on smaller datasets, confirms
our assumption that the partition based pruning strategy is more efficient for
larger dataset. We also observe that the same pattern emerges when varying
dimensionality, which suggest that the choice of β is not affected by the dimensionality of the data.
The last distribution we consider is the correlated data in Figure 4.6. This is
where we would expect the largest improvement and indeed this is the case. All
execution time drop drastically as the value of β is increased, and we clearly see
that the hybrid algorithm improves the running time for the correlated data.
This confirms that the partition based pruning strategy is less efficient for small
datasets, which will be discussed in more detail in Section 4.4.
While we have observed a high relative improvement in the execution time for
the correlated dataset, the gain in whole seconds across all the distributions is
less impressive. We have not observed a full second gained, which shows that
the gained speedup from doing fewer computations more or less matches the
extra computational throughput of GGS. An analysis of the raw data shows
that highest gain across all the distributions was found at 6144 so this will be
our β value for the remainder of the evaluation.
48
10
4
6
8
10
6
time(sec)
8
●
7168
8192
●
●
●
●
●
●
8192
6144
●
7168
5120
●
6144
4096
(a) Cardinality (d=10)
●●● ●
5120
●
4096
●
3072
●
2048
●
32
256
512
●
1024
●
0
●
3072
32
256
512
●
1024
●●
● ●
2048
2
4
10
5
time(sec)
15
20
10000
100000
300000
500000
700000
1000000
●
(b) Dimensionality (n=500K)
1.0
0.6
0.8
●
●
●
●
●
●
●
●
●
8192
8192
7168
6144
5120
4096
3072
2048
32
256
512
1024
(a) Cardinality (d=10)
●●● ●
7168
●
6144
●
5120
●
4096
●
3072
●
2048
●
32
256
512
●
1024
●
0.0
● ●
●
0.2
0.2
0.4
time(sec)
0.6
●
10000
100000
300000
500000
700000
1000000
0.4
●
0.0
time(sec)
0.8
1.0
Figure 4.5: Independent data
(b) Dimensionality (n=500K)
Figure 4.6: Correlated data
49
4
6
8
10
4.4
Comparing algorithms on synthetic data
Now we compare all of our algorithms and the QSkycube algorithm [12] with
each other. We chose to compare our solution to the QSkycube algoritm since
it is our believe that QSkycube is currently is the state-of-the-art sequential
skycube algorithm. We have been able to recreate the reported results for the
algorithm, and to the best of our knowledge, no other skycube algorithm claims
to perform better than QSkycube.
Once again we execute the algorithms on anticorrelated, independent and correlated data. We vary the cardinality of the data from 10K to 1M points, and
the dimensionality from 4 to 12 dimensions. Since we have suggested several
approaches we list the algorithms below and provide a short description for each
of them.
Naive is the approach where the data is transfered to the device once, but
each cuboid is computed separately without sharing any information.
SingleParent computes both the extended and normal skyline for each cuboid,
and uses the extended skyline from a parent in the lattice as input for the
computation of a child cuboid.
MultipleParent extends SingleParent by intersecting all parent extended skylines, rather than just using one parent as input to the cuboid computation.
Presort extends MultipleParent by presorting the data on each dimension
once, instead of resorting the data for each cuboid computation.
CPUParallel extends MultipleParent by doing the intersection of parent extended skylines in a separate CPU thread.
CudaStreams extends multipleParent by executing several cuboid calculations in parallel, using a seperate CPU thread and a seperate CUDA
stream for each cuboid.
SpacePartition first partitions the data into several groups according to a
pivot point, and then prunes by having a thread for each point iterate
relevant groups. It also uses the approach presented in multipleParents
to reduce input data size.
Hybrid combines SpacePartition with a simplified edition of Naive to better
handle small input datasets.
QSkycube is the state-of-the-art sequential algorithm that we compare our
solutions to [12].
In the following analysis we will be referring to Hybrid and SpacePartition as
the ’partition based algorithms’, while the term ’GGS based algorithms’ covers
the rest of the GPU based algorithms. We analyse the performance of the algorithms on each of the distributions separately.
First we look at varying the cardinality for the anticorrelated data in Figure
50
3000
300
●
1500
time(sec)
2000
2500
●
Naive
SingleParent
MultipleParent
Presort
CPUParallel
CudaStreams
SpacePartition
Hybrid
Qskycube
●
1000
600
Naive
SingleParent
MultipleParent
Presort
CPUParallel
CudaStreams
SpacePartition
Hybrid
Qskycube
200
time(sec)
400
500
●
●
100
500
●
●
●
10K
0
0
●
●
300K
500K
700K
1000K
●
●
4
6
Cardinality
●
8
10
12
Dimensions
(a) Cardinality (d=10)
(b) Dimensionality (n=500K)
Figure 4.7: Anticorrelated data
4.7(a). It is clear to see that all of our approaches scale better than the QSkycube approach. It is interesting to see that Presort scale worse than the other
GPU based algorithms. This indicates that the more efficient pruning, gained
from sorting the indices for each cuboid, has a larger impact than the time
saved by presorting the indices on each dimension. This is not surprising since
using a sort on a single dimension produces an order with less efficient pruning power, especially in anticorrelated data where points that are good in one
dimension typically are bad in the others. If we look at the other GGS based
algorithms, we see a very small variation in the execution times, and we even see
that Naive is a little bit faster as we scale on dimensionality. This indicates that
the extra computations needed to produce the extended skyline have a larger
impact on the running-time, than the reduction in dataset input size which the
extended skyline facilitates. This effect might have been expected, since the
anti-correlated data distribution produces large skylines and thus reduces the
impact of using parent extended skylines as input for computations. The last
observation we make on varying cardinality is that partition based algorithms
clearly outperform the GGS based algorithms. This is due to the efficiency
of partitioning the data, which allows a lot of dominance tests to be skipped.
This effect is especially visible in the anticorrelated data, since the skyline size
for this type of distribution tends to be large. As we also observed in Section
4.3 there is almost no difference between SpacePartition and Hybrid, since the
partition based approach is efficient for most of the cuboid computations.
Next we look at varying the dimensionality in Figure 4.7(b). Again our approaches scale much better than QSkycube as we increase the dimensionality.
This figure also demonstrates the impact of computing the 2d − 1 cuboids very
well, and that the raw parallel computing power of the GPU is a good fit for
these cases. For the GPU based algorithms we see a pattern similar to what
we observed for varying cardinality, and for the same reasons. Presort is still
the slowest due to the inefficient sorting. Naive is still faster than the other
51
600
Naive
SingleParent
MultipleParent
Presort
CPUParallel
CudaStreams
SpacePartition
Hybrid
Qskycube
200
300
400
500
●
time(sec)
70
40
Naive
SingleParent
MultipleParent
Presort
CPUParallel
CudaStreams
SpacePartition
Hybrid
Qskycube
30
time(sec)
50
60
●
20
●
100
●
●
10
●
●
●
0
0
●
10K
300K
500K
700K
1000K
●
●
4
6
Cardinality
●
8
●
10
12
Dimensions
(a) Cardinality (d=10)
(b) Dimensionality (n=500K)
Figure 4.8: Independent data
GGS based algorithms due to fewer computations in the dominance test and
the partition based algorithms are still the most efficient due to the partitioning
strategy. It is interesting to see that CPUParallel runs about 200 miliseconds
faster than MultiParent for all variations on both cardinality and dimensionality, which indicates that the impact of intersecting the extended skylines is not
very high.
We now turn to the independent data distribution, for which the results are
plottet in Figure 4.8. We first look at varying the cardinality in Figure 4.8(a)
where we again observe that the GPU based approach scales much better than
QSkycube, with Hybrid outperforming it by a factor 5 for the largest dataset.
For the GGS based algorithms it is interesting to see that Naive is much slower
than the other approaches for independent dataset, while it was slightly faster
for the anticorrelated data. This indicates that the extra processing needed to
compute the extended skyline and to reduce the input data set is well worth the
effort for independent data. Again we observe that Presort is slower than the
other GGS based algorithms, although the margin is smaller. The reason for
this is that the order obtained by sorting on a single dimension is less inefficient
for independent data, where the value of one dimension does not have a specific
relation to the others. However we see that it is still faster to sort the data for
each cuboid, than to use the presorted lists. We also observe that MultiParent runs consistently faster than SingleParent, which means that intersecting
parent extended skylines indeed reduces the input dataset. The last observation we make is that the partition based algorithms are still the most efficient,
although by a smaller margin. This reason for this is that the smaller input
datasets provides less opportunity to bypass computations, which can also be
seen by the fact that Hybrid is consistently faster than SpacePartition. Looking
at the varying dimensionality in Figure 4.8(b) we see a picture very similar to
what we saw for anticorrelated data, with the biggest difference being that the
naive approach is now slower than the majority of the GGS based algorithms
52
as discussed above. The last distribution we look at is the correlated one in
Figure 4.9. If we first look at Figures 4.9(a) and 4.9(b) we can clearly see that
Naive performs much worse than all the others. This is due to the fact that
it recomputes the skyline for each cuboid, while all the other algorithms use
previous results as input for the next computations. Since the skyline is small
for the correlated data, the effect of the latter strategy is seen very clearly. In
fact Naive makes it very hard to compare the remaining algorithms, so Figures
4.9(c) and 4.9(d) show the same results, but without Naive. We first look at
varying cardinality. The first observation we make is that these computations
run extremely fast, and that all algorithms finish in less than a second. Unlike
the independent and anticorrelated data that took several seconds, if not minutes to finish for the large datasets. We also notice that QSkycube runs faster
than all our algorithms for all but the biggest dataset. One might expect this,
since the correlated data creates very small skylines. I.e. the size of the full
skyline for the 1M records dataset was just 540, which means that the GPU
ends up being under-utilized. This observation is backed up by the fact that
CudaStreams for the first time runs noticeably faster than MultipleParent. We
also observe that Presort outperforms all of the other GGS based algorithms.
The reason for this is that the sort applied by Presort becomes efficient for this
type of data-distribution, since a datapoint that is good in one dimension, will
typically also be good in the others. This means that the sorted order of the
presorted lists ends up resembling the sorted order you get for the Manhattan
norm. Therefore Presort also have the same a similar efficiency, without having
to resort the indices for each cuboid. The last observation we make on cardinality is that the partition based computations no longer have similar speeds.
Indeed Hybrid is about a factor 3 faster than SpacePartition. This confirms our
observations from Section 4.3 where we also observed that the biggest effect of
the Hybrid algorithm was on the correlated data.
If we look at the effect of changing the dimensionality in Figure 4.9(d), we see
that Hybrid and Presort scale much better than the other algorithms, especially as we reach 12 dimensions. We also see that CudaStreams scale better
than both QSkycube, SpacePartition and the rest of the GGS based algorithms.
This is interesting, because it indicates that the other GGS based algorithms
underutilize the GPU, and so the difference in execution time is purely due to
the gain from intersecting multiple parents.
We end this section by concluding that Hybrid is the best approach for a GPU
based skycube algorithm, since it is the most versatile among the proposed
solutions, and it scales equally well on all distributions, both with respect to
dimensionality and cardinality.
53
35
30
25
20
time(sec)
10
8
6
Naive
SingleParent
MultipleParent
Presort
CPUParallel
CudaStreams
SpacePartition
Hybrid
Qskycube
5
2
10
4
time(sec)
●
15
12
Naive
SingleParent
MultipleParent
Presort
CPUParallel
CudaStreams
SpacePartition
Hybrid
Qskycube
●
●
●
●
10K
300K
500K
700K
●
●
0
●
0
●
1000K
●
●
●
4
6
8
Cardinality
12
SingleParent
MultipleParent
Presort
CPUParallel
CudaStreams
SpacePartition
Hybrid
Qskycube
●
4
5
(b) Dimensionality (n=500K)
●
●
●
●
3
●
●
0.4
2
0.2
1
time(sec)
●
time(sec)
0.6
0.8
1.0
SingleParent
MultipleParent
Presort
CPUParallel
CudaStreams
SpacePartition
Hybrid
Qskycube
10
Dimensions
(a) Cardinality (d=10)
●
●
●
0
0.0
●
10K
300K
500K
700K
1000K
●
●
4
6
Cardinality
8
10
12
Dimensions
(c) Cardinality (d=10)
(d) Dimensionality (n=500K)
Figure 4.9: Correlated data
4.5
Comparing algorithms on real data
In Section 4.4, we clearly saw that our hybrid algorithm is the most efficient
among our proposed skycube algorithms, and we therefore only test Hybrid
against QSkycube [12] in this section.
To test the performance on real data sets we use three datasets. The first is the
NBA dataset1 , which is an 8-dimensional dataset with a cardinality of 17.264
and it represents the yearly performance of professional basketball players in
the U.S. The second dataset is the Household dataset2 which is a 6-dimensional
dataset with a cardinality of 127.931. It presents the ratio of expenditure to
income for American families. The last dataset, called ColorMoments3 is a
10-dimensional dataset with a cardinality of 68.040. This dataset represents
1
http://www.nba.com
http://www.ipums.org
3
http://sci2s.ugr.es/keel
2
54
image features extracted from an image collection. These dataset are typical
in skyline research and have been used in several algorithms [12, 11, ?, 2]. As
for the synthetic data all reported execution time are averages of 10 runs and
reported in seconds. The results can be found in Table 4.1.
Dataset
Full skyline size
Ratio of skyline points
QSkycube
Hybrid
Household
5774
4.51%
0.161
0.080
NBA
1796
10.40%
0.171
0.084
ColorMoments
1533
2.25%
0.266
0.185
Table 4.1: Run times and properties for real datasets
We have specified the size of the full skyline since both algorithms use previous
results as input for the next computations, and there are no duplicate values in
the datasets. That is, all cuboid computations except for the first are done on
dataset sizes corresponding to the full skyline size or smaller. We also report
the ratio of points in the full skyline to as compared to the dataset size, size this
indicates the type relation the data have. As we can see both the ColorMoments
dataset and the Household dataset appear correlated, while the NBA dataset
is closer to an independent distribution. If we look at the results, we clearly
see that our Hybrid algorithm is faster on for all three datasets, and more than
twice as fast for the Household and NBA datasets. This clearly shows that our
approach is also superior on real world datasets, even though our results for
correlated data in Section 4.4 suggest that the QSkycube is faster on correlated
data of these sizes. This indeed shows why we test on both real on synthetic
datasets. While the synthetic datasets are good for observing general tendencies
the do not fully reflect the types of data-distributions we see in the real world,
since these typically have some degree of outliers, clusters or other distributional
properties that are not captured by the standard synthetic datasets. We would
also like to note that the NBA and Household datasets were kindly provided
by the authors of Qskycube [12]. This gave us an opportunity to test their
algorithm against the findings in the original paper. Since we run on faster
hardware we would assume that the algorithms run faster on our hardware,
and indeed they do. The run time reported for the two datasets in [12] was
0.231 and 0.276 respectively. This verifies that we have a very fair comparison
between QSkycube and Hybrid.
55
56
Chapter 5
Conclusion and future work
In this thesis we have investigated the main challenges of doing skyline and
skycube computations on the GPU, and we have developed an efficient skyline
algorithm as well as several efficient skycube algorithms. We have shown that
the most efficient GPU based skycube algorithm uses a partition based pruning
strategy for large datasets, while it relies on a simpler, iteration based strategy
for small datasets. We have shown that our algorithms scale much better than
sequential state-of-the-art algorithms on large, high-dimensional datasets, and
that they significantly outperform state-of-the-art algorithms on computationally demanding data distributions.
5.1
Future work
The GPU memory easily supported the size of the datasets used in Chapter 4.
However, real world datasets can be very large and does not necessarily fit in
the device memory. In such cases our current solution does not work and different strategies must be considered. The CUDA framework supports a feature
called Unified Virtual Address Space, that allows the GPU to address the host
memory directly, thus allowing the GPU to read data directly from the CPU
memory. While the pci-e bus might be a bottle neck in such a computation,
it would be very interesting to investigate to what extend, and if it would be
possible to develop a strategy that places the most used data on the GPU while
the rest remains on the CPU. This could for example be done by placing the
data with the smallest Manhattan norm on the GPU while the remainder is
placed on the CPU, since data with a small Manhattan norm is more likely to
be in the skyline.
The strategy we have developed in this thesis has only been tested with one
GPU. However, the CUDA framework supports using multiple GPUs from the
same process. Since the cuboid computations can be done independently in
each layer of the lattice, it would be interesting to investigate if and how multiple GPUs can be used to further speed up the computation. Alternatively,
this could also be done by having each GPU compute cuboids in a depth first
manner in the lattice, since our experiments show that using multiple parents
for decreasing input dataset size did not yield a significant speed-up.
57
58
Bibliography
[1] Thrust parallel datastructure library, 2012. Available from: http://docs.
nvidia.com/cuda/thrust/.
[2] Ilaria Bartolini, Paolo Ciaccia, and Marco Patella.
Salsa: computing the skyline without scanning the whole sky.
In Proceedings of the 15th ACM international conference on Information and
knowledge management, CIKM ’06, pages 405–414, New York, NY,
USA, 2006. ACM. Available from: http://doi.acm.org/10.1145/
1183614.1183674, http://dx.doi.org/10.1145/1183614.1183674 doi:10.
1145/1183614.1183674.
[3] S. Borzsony, D. Kossmann, and K. Stocker. The skyline operator. In
Data Engineering, 2001. Proceedings. 17th International Conference on,
pages 421 –430, 2001. http://dx.doi.org/10.1109/ICDE.2001.914855 doi:
10.1109/ICDE.2001.914855.
[4] Wonik Choi, Ling Liu, and Boseon Yu. Multi-criteria decision making with skyline computation. In Information Reuse and Integration
(IRI), 2012 IEEE 13th International Conference on, pages 316 –323,
aug. 2012. http://dx.doi.org/10.1109/IRI.2012.6303026 doi:10.1109/
IRI.2012.6303026.
[5] Jan Chomicki, Parke Godfrey, Jarek Gryz, and Dongming Liang. Skyline with presorting: Theory and optimizations. In Mieczyslaw Klopotek,
Slawomir Wierzchon, and Krzysztof Trojanowski, editors, Intelligent
Information Processing and Web Mining, volume 31 of Advances in
Soft Computing, pages 595–604. Springer Berlin / Heidelberg, 2005.
10.1007/3-540-32392-9_72. Available from: http://dx.doi.org/10.
1007/3-540-32392-9\_72.
[6] Katerina Fotiadou and Evaggelia Pitoura. Bitpeer: continuous subspace skyline computation with distributed bitmap indexes. In Proceedings of the 2008 international workshop on Data management
in peer-to-peer systems, DaMaP ’08, pages 35–42, New York, NY,
USA, 2008. ACM. Available from: http://doi.acm.org/10.1145/
1379350.1379356, http://dx.doi.org/10.1145/1379350.1379356 doi:10.
1145/1379350.1379356.
[7] Parke Godfrey, Ryan Shipley, and Jarek Gryz. Maximal vector computation in large data sets. In Proceedings of the 31st international conference
59
on Very large data bases, VLDB ’05, pages 229–240. VLDB Endowment,
2005. Available from: http://dl.acm.org/citation.cfm?id=1083592.
1083622.
[8] Matteo Magnani Kenneth S. Bøgh, Ira Assent. Efficient gpu-based skyline
computation, to appear. In Proceedings of the Ninth International Workshop on Data Management on New Hardware, DaMoN ’13, New York, NY,
USA, 2013. ACM.
[9] Henning Köhler, Jing Yang, and Xiaofang Zhou. Efficient parallel skyline processing using hyperplane projections. In Proceedings of the 2011
international conference on Management of data - SIGMOD ’11, pages 85–
94, New York, New York, USA, June 2011. ACM Press. Available from:
http://dl.acm.org/citation.cfm?id=1989323.1989333.
[10] Donald Kossmann, Frank Ramsak, and Steffen Rost. Shooting stars in
the sky: an online algorithm for skyline queries. In Proceedings of the
28th international conference on Very Large Data Bases, VLDB ’02, pages
275–286. VLDB Endowment, 2002. Available from: http://dl.acm.org/
citation.cfm?id=1287369.1287394.
[11] Jongwuk Lee and Seung-won Hwang.
Bskytree: scalable skyline
computation using a balanced pivot selection.
In Proceedings of
the 13th International Conference on Extending Database Technology, EDBT ’10, pages 195–206, New York, NY, USA, 2010. ACM.
Available from:
http://doi.acm.org/10.1145/1739041.1739067,
http://dx.doi.org/10.1145/1739041.1739067
doi:10.1145/1739041.
1739067.
[12] Jongwuk Lee and Seung-won Hwang. Qskycube: efficient skycube computation using point-based space partitioning. Proc. VLDB Endow.,
4(3):185–196, December 2010. Available from: http://dl.acm.org/
citation.cfm?id=1929861.1929865.
[13] KenC.K. Lee, Wang-Chien Lee, Baihua Zheng, Huajing Li, and Yuan Tian.
Z-sky: an efficient skyline query processing framework based on z-order.
The VLDB Journal, 19(3):333–362, 2010. Available from: http://dx.doi.
org/10.1007/s00778-009-0166-x,
http://dx.doi.org/10.1007/s00778009-0166-x doi:10.1007/s00778-009-0166-x.
[14] NVIDIA. NVIDIA CUDA C Programming Guide, 2013. Available from:
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.
html.
[15] openCL. Official opencl website, 2013. Available from: http://docs.
nvidia.com/cuda/cuda-c-programming-guide/index.html.
[16] Dimitris Papadias, Yufei Tao, Greg Fu, and Bernhard Seeger. An
optimal and progressive algorithm for skyline queries. In Proceedings
of the 2003 ACM SIGMOD international conference on Management
60
of data, SIGMOD ’03, pages 467–478, New York, NY, USA, 2003.
ACM. Available from: http://doi.acm.org/10.1145/872757.872814,
http://dx.doi.org/10.1145/872757.872814 doi:10.1145/872757.872814.
[17] Sungwoo Park, Taekyung Kim, Jonghyun Park, Jinha Kim, and
Hyeonseung Im.
Parallel skyline computation on multicore architectures.
In Data Engineering, 2009. ICDE ’09. IEEE 25th International Conference on, pages 760 –771, 29 2009-april 2 2009.
http://dx.doi.org/10.1109/ICDE.2009.42 doi:10.1109/ICDE.2009.42.
[18] Jian Pei, Yidong Yuan, Xuemin Lin, Wen Jin, Martin Ester, Qing Liu, Wei
Wang, Yufei Tao, Jeffrey Xu Yu, and Qing Zhang. Towards multidimensional subspace skyline analysis. ACM Trans. Database Syst., 31(4):1335–
1381, December 2006. Available from: http://doi.acm.org/10.1145/
1189769.1189774, http://dx.doi.org/10.1145/1189769.1189774 doi:10.
1145/1189769.1189774.
[19] Yufei Tao, Xiaokui Xiao, and Jian Pei. Subsky: Efficient computation
of skylines in subspaces. In Data Engineering, 2006. ICDE ’06. Proceedings of the 22nd International Conference on, page 65, april 2006.
http://dx.doi.org/10.1109/ICDE.2006.149 doi:10.1109/ICDE.2006.149.
[20] Berna L. Massingill Timothy G. Mattson, Beverly A. Sanders. Patterns
for Parallel Programming. Pearson Education, Inc, 1 edition, 2005.
[21] A. Vlachou, C. Doulkeridis, Y. Kotidis, and M. Vazirgiannis. Skypeer:
Efficient subspace skyline computation over distributed data. In Data
Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on,
pages 416–425, 2007. http://dx.doi.org/10.1109/ICDE.2007.367887 doi:
10.1109/ICDE.2007.367887.
[22] Akrivi Vlachou, Christos Doulkeridis, and Yannis Kotidis.
Anglebased space partitioning for efficient parallel skyline computation. In
Proceedings of the 2008 ACM SIGMOD international conference on
Management of data, SIGMOD ’08, pages 227–238, New York, NY,
USA, 2008. ACM. Available from: http://doi.acm.org/10.1145/
1376616.1376642, http://dx.doi.org/10.1145/1376616.1376642 doi:10.
1145/1376616.1376642.
[23] Ping Wu, Caijie Zhang, Ying Feng, Ben Zhao, Divyakant Agrawal, and
Amr El Abbadi. Parallelizing skyline queries for scalable distribution.
In Yannis Ioannidis, Marc Scholl, Joachim Schmidt, Florian Matthes,
Mike Hatzopoulos, Klemens Boehm, Alfons Kemper, Torsten Grust, and
Christian Boehm, editors, Advances in Database Technology - EDBT
2006, volume 3896 of Lecture Notes in Computer Science, pages 112–130.
Springer Berlin Heidelberg, 2006. 10.1007/11687238_10. Available from:
http://dx.doi.org/10.1007/11687238\_10.
[24] Yidong Yuan, Xuemin Lin, Qing Liu, Wei Wang, Jeffrey Xu Yu, and Qing
Zhang. Efficient computation of the skyline cube. In Proceedings of the
61
31st international conference on Very large data bases, VLDB ’05, pages
241–252. VLDB Endowment, 2005. Available from: http://dl.acm.org/
citation.cfm?id=1083592.1083623.
[25] Shiming Zhang, Nikos Mamoulis, and David W. Cheung.
Scalable skyline computation using object-based space partitioning. In
Proceedings of the 2009 ACM SIGMOD International Conference on
Management of data, SIGMOD ’09, pages 483–494, New York, NY,
USA, 2009. ACM. Available from: http://doi.acm.org/10.1145/
1559845.1559897, http://dx.doi.org/10.1145/1559845.1559897 doi:10.
1145/1559845.1559897.
62
Appendix A
GPU glossary
Below is a table containing a glossary of the terminology used in the GPU
context and a short description of each term.
63
Term
SP
SM
Thread
Warp
Branch divergence
Block
Grid
Global memory
Shared memory
Registers
Local memory
Constant memory
Coalesced reading
Description
Streaming Processor is the name of the execution units (also
known as cores) of the GPU.
Streaming Multiprocessor is a collection of SPs which share
resources such as memory, register file and scheduler
The logical execution unit, that runs on an SP and executes
instructions.
A collection of 32 threads that execute step-locked, that is all
threads in a warp execute the same instruction on different
data.
When the threads in a warp evaluates a switch statement
such as an if-then-else differently, thus forcing the threads
in the warp to to be split according to the branching. Each
subset of threads of the warp is then executed one by one,
thus rendering SPs idle.
A logical construction used to organize threads. It is organized in a 1D, 2D or 3D matrix, and is executed on a single
SM. As such the size of the SM limits the size of the block.
A logical construction used to organize blocks. It is organized in a 1D, 2D or 3D matrix.
The most abundant memory on the GPU, but also the slowest and with a large read/write latency. It is used when
transferring data to the GPU from the CPU, and can be
accessed by all threads.
48kb of memory, located on the SM chip. A very fast, low
latency memory that is shared by threads in a block. Typically used for keeping data that the threads need to access
more than once, to minimize global memory latency.
Very scares resource. Memory that can is assigned to, and
only can be read/written by a single thread.
If a thread allocates more registers than is available, the remaining registers are allocated in a section of global memory,
which is called local memory.
Cached, specialized read-only memory. Can only be written
from the CPU. Have the ability to broadcast data to many
threads if they request the same address.
Refers to accessing data elements in global memory that
address wise is next to each other. This is important since
access to global memory are always done in 128-bits and
several reads can as such be achieved in just one read.
64

Download Report

Efficient Skycube Computation on the GPU

Paperzz.com

Your Paperzz