Cassandra and Sigmod contest Cloud computing group Haiping Wang 2009-12-19 Outline Cassandra • • • • Cassandra overview Data model Architecture Read and write Sigmod contest 2009 Sigmod contest 2010 Cassandra overview Highly scalable, distributed Eventually consistent Structured key-value store Dynamo + bigtable P2P Random reads and random writes Java Data Model ColumnFamily1 Name : MailList KEY Name : tid1 Name : tid2 Name : tid3 Name : tid4 Value : <Binary> Value : <Binary> Value : <Binary> Value : <Binary> TimeStamp : t1 TimeStamp : t2 TimeStamp : t3 TimeStamp : t4 ColumnFamily2 Column Families are declared upfront are SuperColumns added and modified Columns are added dynamically and modified dynamically Columns are added and modified Type : Simple dynamically Sort : Name Name : WordList Type : Super Name : aloha Sort : Time Name : dude C1 C2 C3 C4 C2 C6 V1 V2 V3 V4 V2 V6 T1 T2 T3 T4 T2 T6 ColumnFamily3 Name : System Type : Super Sort : Name Name : hint1 Name : hint2 Name : hint3 Name : hint4 <Column List> <Column List> <Column List> <Column List> Cassandra Architecture Cassandra API • Data structures • Exceptions • Service API • • • • ConsistencyLevel(4) Retrieval methods(5) Range query: returns matching keys(1) Modification methods(3) • Others Cassandra commands Partitioning and replication(1) • Consistent hashing • DHT • • • • Balance Monotonicity Spread Load • Virtual nodes • Coordinator • Preference list Partitioning and replication(2) 1 0 h(key1) E A N=3 C h(key2) F B D 1/2 9 Data Versioning • Always writeable • Mulitple versions – put() return before all replicas – get() many versions • Vector clocks • Reconciliation during reads by clients Vector clock • List of (node, counter) pairs E.g. [x,2][y,3] vs. [x,3][y,4][z,1] [x,1][y,3] vs. [z,1][y,3] • Use timestamp E.g. D([x,1]:t1,[y,1]:t2) • Remove the oldest version when reach a thresthold Vector clock Return all the objects at the leaves D3,4([Sx,2],[Sy,1],[Sz,1]) Single new version Excution operations • Two strategies – A generic load balancer based on load balance • Easy ,not have to link any code specific – Directory to the node • Achieve lower latency Put() operation w-1 responses client coordinator Object with vector clock P1 P2 PN-1 Cluster Membership • Gossip protocol • State disseminated in O(logN) rounds • Increase its heartbeat counter and send its list to another every T seconds • Merge operations Failure • Data center(s) failure – Multiple data centers • Temporary failure • Permanent failure – Merkle tree Temporary failure Merkle tree Boolom filter a space-efficient probabilistic data structure used to test whether an element is a member of a set false positive Compactions K1 < Serialized data > K2 < Serialized data > K3 < Serialized data > -Sorted --- K2 < Serialized data > K4 < Serialized data > K10 < Serialized data > K5 < Serialized data > K30 < Serialized data > K10 < Serialized data > DELETED -- Sorted --- MERGE SORT Index File K1 < Serialized data > Loaded in memory K2 < Serialized data > K3 < Serialized data > K1 Offset K5 Offset K30 Offset Bloom Filter Sorted K4 < Serialized data > K5 < Serialized data > K10 < Serialized data > K30 < Serialized data > Data File Sorted ---- Write Key (CF1 , CF2 , CF3) • Data size Memtable ( CF1) Commit Log • Number of Objects • Lifetime Memtable ( CF2) Binary serialized Key ( CF1 , CF2 , CF3 ) Memtable ( CF2) Data file on disk K128 Offset Dedicated Disk <Key name><Size of key Data><Index of columns/supercolumns>< Serialized column family> --- K256 Offset --- K384 Offset --- Bloom Filter <Key name><Size of key Data><Index of columns/supercolumns>< Serialized column family> (Index in memory) BLOCK Index <Key Name> Offset, <Key Name> Offset --- Read Client Query Result Cassandra Cluster Closest replica Read repair if digests differ Result Replica A Digest Response Replica B Digest Query Digest Response Replica C Outline Cassandra • • • • Cassandra overview Data model Architecture Read and write Sigmod contest 2009 Sigmod contest 2010 Sigmod contest 2009 Task overview API Data structure Architecture Test Task overview • • • • Index system for main memory data Running on multi-core machine Many threads with multiple indices Serialize execution of user-specified transactions • Basic function exact match queries ,range queries , updates inserts , deletes API Record HashTable Hsize 0 1 ... size-1 key key key key key key key key key key key key key key key key size hashTab average(64) deviation nbEl domain warpMode(bool) dataType key ; int64_t hashKey ; char * payload ; *nexrt HashShared int 类型数据 0 1 2 3 ... 999 0 100 200 300 ... 999000 1 101 201 301 ... 99901 2 102 202 302 ... 99902 ... ... ... ... ... ... 99 199 299 399 ... 99999 nbNameIndex ni idx str NameIndex 类型数据 IdxState类型的对象 \0 TxnState state indexActive indexToReset nbIndex th iNbR 0 1 2 3 ... 199 0 1 2 3 ... 199 IdxState • • • • • Keep track of an index Created openIndex() Destroyed closeIndex() Inherited by IdxStateType Contains pointers pointing to – – – – a hashtable a FixedAllocator a Allocator a array with the type of action Architecture DeadLockDete ctor indexManager transactor Allocator IndexManager hs indexTab[0] nbIndexTab indexTab[1] indexTab indexTab[2] indexTab[3] indexTab[i] indexTab[..] DeadLockDetector Transactor • a HashOnlyGet object with type TxnState nbNameIndex 0 1 2 3 ... 999 id mutex 0 100 200 300 ... 999000 iThread 1 101 201 301 ... 99901 nbElement 2 102 202 302 ... 99902 ... ... ... ... ... ... 99 199 299 399 ... 99999 pt data T Allocator • • • • Allocate the memory for the payloads Use pools and linked list Pool sized --the max length of payload is 100 The payloads with the same payload are in the same list Unit Tests • three threads , run over three indices • the primary thread – create the primary index – inserts, deletes and accesses data in the primary index • the second thread – simultaneously runs some basic tests over a separate index • the third thread – ensure the transactional guarantees – Continuously queries the primary index Outline Cassandra • • • • Cassandra overview Data model Architecture Read and write Sigmod contest 2009 Sigmod contest 2010 Task overview • Implement a simple distributed query executor with the help of the in-memory index • Given centralized query plans, translate them into distributed query plans • Given a parsed SQL query, return the right results • Data stored on disk, the indexes are all in memory • Measure the total time costs SQL query form SELECT alias_name.field_name, ... FROM table_name AS alias_name,… WHERE condition1 AND ... AND conditionN Condition alias_name.field_name = fixed value alias_name.field_name > fixed value alias_name.field_name1 =alias_name.field_name2 Initialization phase Connection phase Query phase Closing phase Tests • • • • • An initial computation On synthetic and real-world datasets Tested on a single machine Tested on an ad-hoc cluster of peers Passed a collection of unit tests , provided with an Amazon Web Services account of a 100 USD value Benchmarks(stag1) • Assume a partition always cover the entire table, the data is not replicated. • Unit-tests • Benchmarks – – – – – On a single node, selects with an equal condition on the primary key On a single node, selects with an equal condition on an indexed field On a single node, 2 to 5 joins on tables of different size On a single node, 1 join and a "greater than" condition on an indexed field On three nodes, one join on two tables of different size, the two tables being on two different nodes Benchmarks(stag2) • Tables are now stored on multiple nodes • Part of a table, or the whole table may be replicated on multiple nodes • Queries will be sent in parallel up to 50 simultaneous connections • Benchmarks – Selects with an equal condition on the primary key, the values being uniformly distributed – Selects with an equal condition on the primary key, the values being nonuniformly distributed – Multiple joins on tables separated on different nodes Important Dates Thank you!!!
© Copyright 2026 Paperzz