ppt

Cassandra and
Sigmod contest
Cloud computing group
Haiping Wang
2009-12-19
Outline
Cassandra
•
•
•
•
Cassandra overview
Data model
Architecture
Read and write
Sigmod contest 2009
Sigmod contest 2010
Cassandra overview
Highly scalable, distributed
Eventually consistent
Structured key-value store
Dynamo + bigtable
 P2P
Random reads and random writes
Java
Data Model
ColumnFamily1 Name : MailList
KEY
Name : tid1
Name : tid2
Name : tid3
Name : tid4
Value : <Binary>
Value : <Binary>
Value : <Binary>
Value : <Binary>
TimeStamp : t1
TimeStamp : t2
TimeStamp : t3
TimeStamp : t4
ColumnFamily2
Column Families
are declared
upfront are
SuperColumns
added and
modified
Columns
are added
dynamically
and modified
dynamically
Columns are added
and modified
Type : Simple dynamically
Sort : Name
Name : WordList
Type : Super
Name : aloha
Sort : Time
Name : dude
C1
C2
C3
C4
C2
C6
V1
V2
V3
V4
V2
V6
T1
T2
T3
T4
T2
T6
ColumnFamily3 Name : System
Type : Super
Sort : Name
Name : hint1
Name : hint2
Name : hint3
Name : hint4
<Column List>
<Column List>
<Column List>
<Column List>
Cassandra Architecture
Cassandra API
• Data structures
• Exceptions
• Service API
•
•
•
•
ConsistencyLevel(4)
Retrieval methods(5)
Range query: returns matching keys(1)
Modification methods(3)
• Others
Cassandra commands
Partitioning and replication(1)
• Consistent hashing
• DHT
•
•
•
•
Balance
Monotonicity
Spread
Load
• Virtual nodes
• Coordinator
• Preference list
Partitioning and replication(2)
1 0
h(key1)
E
A
N=3
C
h(key2)
F
B
D
1/2
9
Data Versioning
• Always writeable
• Mulitple versions
– put() return before all replicas
– get() many versions
• Vector clocks
• Reconciliation during reads by clients
Vector clock
• List of (node, counter) pairs
E.g. [x,2][y,3] vs. [x,3][y,4][z,1]
[x,1][y,3] vs. [z,1][y,3]
• Use timestamp
E.g. D([x,1]:t1,[y,1]:t2)
• Remove the oldest version when reach a
thresthold
Vector clock
Return all the objects
at the leaves
D3,4([Sx,2],[Sy,1],[Sz,1])
Single new version
Excution operations
• Two strategies
– A generic load balancer based on load balance
• Easy ,not have to link any code specific
– Directory to the node
• Achieve lower latency
Put() operation
w-1 responses
client
coordinator
Object with
vector clock
P1
P2
PN-1
Cluster Membership
• Gossip protocol
• State disseminated in O(logN) rounds
• Increase its heartbeat counter and send its list
to another every T seconds
• Merge operations
Failure
• Data center(s) failure
– Multiple data centers
• Temporary failure
• Permanent failure
– Merkle tree
Temporary failure
Merkle tree
Boolom filter
a space-efficient probabilistic data structure
 used to test whether an element is a member of a set
 false positive
Compactions
K1 < Serialized data >
K2 < Serialized data >
K3 < Serialized data >
-Sorted
---
K2 < Serialized data >
K4 < Serialized data >
K10 < Serialized data >
K5 < Serialized data >
K30 < Serialized data >
K10 < Serialized data >
DELETED
--
Sorted
---
MERGE SORT
Index File
K1 < Serialized data >
Loaded in memory
K2 < Serialized data >
K3 < Serialized data >
K1 Offset
K5 Offset
K30 Offset
Bloom Filter
Sorted
K4 < Serialized data >
K5 < Serialized data >
K10 < Serialized data >
K30 < Serialized data >
Data File
Sorted
----
Write
Key (CF1 , CF2 , CF3)
• Data size
Memtable ( CF1)
Commit Log
• Number of Objects
• Lifetime
Memtable ( CF2)
Binary serialized
Key ( CF1 , CF2 , CF3 )
Memtable ( CF2)
Data file on disk
K128 Offset
Dedicated Disk
<Key name><Size of key Data><Index of columns/supercolumns>< Serialized
column family>
---
K256 Offset
---
K384 Offset
---
Bloom Filter
<Key name><Size of key Data><Index of columns/supercolumns>< Serialized
column family>
(Index in memory)
BLOCK Index <Key Name> Offset, <Key Name> Offset
---
Read
Client
Query
Result
Cassandra Cluster
Closest replica
Read repair if
digests differ
Result
Replica A
Digest Response
Replica B
Digest Query
Digest Response
Replica C
Outline
Cassandra
•
•
•
•
Cassandra overview
Data model
Architecture
Read and write
Sigmod contest 2009
Sigmod contest 2010
Sigmod contest 2009
 Task overview
API
 Data structure
 Architecture
 Test
Task overview
•
•
•
•
Index system for main memory data
Running on multi-core machine
Many threads with multiple indices
Serialize execution of user-specified
transactions
• Basic function
exact match queries ,range queries , updates
inserts , deletes
API
Record
HashTable
Hsize
0
1
...
size-1
key
key
key
key
key
key
key
key
key
key
key
key
key
key
key
key
size
hashTab
average(64)
deviation
nbEl
domain
warpMode(bool)
dataType key ;
int64_t hashKey ;
char * payload ;
*nexrt
HashShared
int 类型数据
0
1
2
3
...
999
0
100
200
300
...
999000
1
101
201
301
...
99901
2
102
202
302
...
99902
...
...
...
...
...
...
99
199
299
399
...
99999
nbNameIndex
ni
idx
str
NameIndex
类型数据
IdxState类型的对象
\0
TxnState
state
indexActive
indexToReset
nbIndex
th
iNbR
0
1
2
3
...
199
0
1
2
3
...
199
IdxState
•
•
•
•
•
Keep track of an index
Created openIndex()
Destroyed closeIndex()
Inherited by IdxStateType
Contains pointers pointing to
–
–
–
–
a hashtable
a FixedAllocator
a Allocator
a array with the type of action
Architecture
DeadLockDete
ctor
indexManager
transactor
Allocator
IndexManager
hs
indexTab[0]
nbIndexTab
indexTab[1]
indexTab
indexTab[2]
indexTab[3]
indexTab[i]
indexTab[..]
DeadLockDetector
Transactor
• a HashOnlyGet object with type TxnState
nbNameIndex
0
1
2
3
...
999
id
mutex
0
100
200
300
...
999000
iThread
1
101
201
301
...
99901
nbElement
2
102
202
302
...
99902
...
...
...
...
...
...
99
199
299
399
...
99999
pt
data
T
Allocator
•
•
•
•
Allocate the memory for the payloads
Use pools and linked list
Pool sized --the max length of payload is 100
The payloads with the same payload are in the
same list
Unit Tests
• three threads , run over three indices
• the primary thread
– create the primary index
– inserts, deletes and accesses data in the primary index
• the second thread
– simultaneously runs some basic tests over a separate index
• the third thread
– ensure the transactional guarantees
– Continuously queries the primary index
Outline
Cassandra
•
•
•
•
Cassandra overview
Data model
Architecture
Read and write
Sigmod contest 2009
Sigmod contest 2010
Task overview
• Implement a simple distributed query executor
with the help of the in-memory index
• Given centralized query plans, translate them
into distributed query plans
• Given a parsed SQL query, return the right
results
• Data stored on disk, the indexes are all in
memory
• Measure the total time costs
SQL query form
SELECT alias_name.field_name, ...
FROM table_name AS alias_name,…
WHERE condition1 AND ... AND conditionN
Condition
alias_name.field_name = fixed value
alias_name.field_name > fixed value
alias_name.field_name1 =alias_name.field_name2
Initialization phase
Connection phase
Query phase
Closing phase
Tests
•
•
•
•
•
An initial computation
On synthetic and real-world datasets
Tested on a single machine
Tested on an ad-hoc cluster of peers
Passed a collection of unit tests , provided with
an Amazon Web Services account of a
100 USD value
Benchmarks(stag1)
• Assume a partition always cover the entire table,
the data is not replicated.
• Unit-tests
• Benchmarks
–
–
–
–
–
On a single node, selects with an equal condition on the primary key
On a single node, selects with an equal condition on an indexed field
On a single node, 2 to 5 joins on tables of different size
On a single node, 1 join and a "greater than" condition on an indexed field
On three nodes, one join on two tables of different size, the two tables being on
two different nodes
Benchmarks(stag2)
• Tables are now stored on multiple nodes
• Part of a table, or the whole table may be
replicated on multiple nodes
• Queries will be sent in parallel up to 50
simultaneous connections
• Benchmarks
– Selects with an equal condition on the primary key, the values being
uniformly distributed
– Selects with an equal condition on the primary key, the values being nonuniformly distributed
– Multiple joins on tables separated on different nodes
Important Dates
Thank you!!!