Framework for Providing Accuracy and Cost Tradeoffs in Cloud

Venkatram Ramanathan
1
 Motivation
Evolution of Multi-Core Machines and the
challenges
 Summary
of Contributions
 Background: MapReduce and FREERIDE
 Wavelet Transform On FREERIDE
 Co-clustering on FREERIDE
 Conclusion
2

Performance Increase:
 Increased number of cores with lower
clock frequencies
 Cost Effective
 Scalability of performance

HPC Environments – Cluster of MultiCores
3

Multi-Level Parallelism
 Within Cores in a node – Shared Memory
Parallelism - Pthreads, OpenMP
 Within Nodes – Distributed Memory
Parallelism - MPI

Achieving Programmability and
Performance – Major Challenge
4
 Possible
solution
Use higher-level/restricted APIs
Reduction based APIs
 Map-Reduce
Higher-level API
Program Cluster of Multi-Cores with 1 API
Expressive Power Considered Limited
 Expressing
computations using
reduction-based APIs
5
 Two
Algorithms
Wavelet Transform
Co-Clustering
 Expressed
as reduction structures and
parallelized on FREERIDE
 Speedup of 42 on 64 cores for Wavelet
Transform
 Speedup of 21 on 32 cores for Coclustering
6
 MapReduce
Map (in_key,in_value) ->
list(out_key,intermediate_value)
Reduce(out_key,list(intermediate_value)
 -> list(out_value)
 FREERIDE
Users explicitly declare Reduction Object
and update it
Map and Reduce steps combined

Each data element – processed and reduced
before next element is processed
7
8
9

Wavelet Transform – Important tool in
Medical Imaging
fMRI – probing mechanism for brain activation
 Seeks to study behavior across spatiotemporal data

10

Discrete Wavelet Transform
Defined for input having 2^n numbers
 Convolution along Time domain results in
2^n output values
 Has following steps

 Pair up Input values
 Store difference
 Pass the sum
 Repeat till there are 2^n – 1 differences and 1
sum
11
 Serial


Wavelet Transform Algorithm
Input: a1, a2, a3, a4, a5, a6, a7, a8
Output: a1-a2, a3-a4, a5-a6, a7-a8
a1+a2-a3-a4, a5+a6-a7-a8
a1+a2+a3+a4-a5-a6-a7-a8
a1+a2+a3+a4+a5+a6+a7+a8
12



Time Series length = T; Number of Nodes = P
Time Series Per Node = T/P
If P is power of 2,
 T/P-1 final values of output calculated locally
 T-P final values produced without communication
 Remaining P values require Inter-process Communication
 Allocate reduction object of size P on each Node
 Each node updates Reduction Object with its contribution
 Global reduction
 The last P values can be calculated.
 Since output – out of order, index on the output where each
final value needs to go can be calculated
13
14





Input data distributed among nodes
Threads share data
Size of reduction object - #Threads x #Nodes
Each thread computes local final values
Updates reduction object at



ThreadID+(#Threads x NodeID)
Global Combination
Calculate last #Threads x #Nodes values from the
data in reduction object
15
16

Computation of the last #Threads x #Nodes values
– Parallelized
 Local Reduction step
 Global Reduction step- Global Array

Size of Reduction Object
 Local Reduction Step : #Threads
 Global Reduction Step: #Nodes
17
18

Index if Iteration I = 0

Index if Iteration I > 0

term is local index of value calculated in current
iteration
Chunkid is ThreadID+(NodeID x #Threads)
 I is current iteration

19

Experimental Setup:





Cluster of Multi-core machines
Intel Xeon CPU E5345 – quad core
Clock Frequency 2.33 GHz
Main Memory 6 GB
Datasets
 Varying p, dimension of spatial cube and s, time-steps in
time series
 p = 10; s = 262144(DS1)
 p = 32; s = 2048 (DS2)
 p = 32; s = 4096 (DS3)
 p = 32; s = 8192 (DS4)
 p = 39; s = 8192 (DS5)
20
21
22
23

Clustering - Grouping together of “similar” objects
Hard Clustering -- Each object belongs to a single
cluster
Soft Clustering -- Each object is probabilistically
assigned to clusters
24

Co-clustering clusters both words and
documents simultaneously
25
Involves simultaneous clustering of rows to row
clusters and columns to column clusters
 Maximizes Mutual Information
 Uses Kullback-Leibler Divergence

KL( p, q)   p( x) log( p( x) q( x))
x
26
27
28


Input matrix and its transpose pre-computed
Input matrix and transpose
 Divided into files
 Distributed among nodes
 Each node - same amount of row and column data


rowCL and colCL – replicated on all nodes
Initial clustering
Round robin fashion - consistency across nodes
29

In Preprocessing,
 pX and pY – normalized by total sum





Wait till all nodes process to normalize
Each node calculates pX and pY with local data
Reduction object updated partial sum, pX and pY values
Accumulated partial sums - total sum
pX and pY normalized
 xnorm and ynorm calculated in second iteration as
they need total sum
30

Compressed Matrix of size #rowclusters
x #colclusters, calculated with local data
Sum of values of values of each row cluster
across each column cluster
 Final compressed matrix -sum of local
compressed matrices
 Local compressed matrices – updated in
reduction object
 Produces final compressed matrix on
accumulation
 Cluster Centroids calculated

31
 Reassign
clustering
 Compute
compressed matrix
Determined by Kullback-Leibler divergence
 Reduction object updated
 Update reduction object
 Column
Clustering – similar
 Objective function – finalize
 Next iteration
32
33
34
Algorithm - same for shared memory,
distributed memory and hybrid parallelization
 Experiments conducted 2 clusters

 env1




Intel Xeon E5345 Quad Core
Clock Frequency 2.33 GHz
Main Memory 6 GB
8 nodes
 env2
 AMD Opteron 8350 CPU 8 Cores
 Main Memory 16 GB
 4 Nodes
35

2 Datasets
 1 GB Dataset
 Matrix Dimensions 16k x 16k
 4 GB Dataset
 Matrix Dimensions 32k x 32k

Datasets and transpose
 Split into 32 files each (row partitioning)
 Distributed among nodes

Number of row and column clusters: 4
36
37
38
39
Preprocessing stage – bottleneck for smaller
dataset – not compute intensive
 Speedup with Preprocessing :
12.17
 Speedup without Preprocessing: 18.75
 Preprocessing stage scales well for Larger
dataset – more computation
 Speedup is the same with and without
preprocessing.
 Speedup for larger dataset
: 20.7

40

Parallelized two data intensive applications,
namely
 Wavelet Transform
 Co-clustering
Representing the algorithms as generalized
reduction structures
 Implementing them on FREERIDE
 Wavelet Transform - speedup 42 on 64 cores
 Co-clustering - speedup 21 on 32 cores.

41
42