Milestone #1 Report - Cs Team Site | courses.cs.tau.ac.il

Workshop in Information Security – Distributed Databases Project
By: Ilia Oshmiansky, Ainat Chervin and Yosi Barad
Milestone #1 Report
Milestone 1: Completing installations and running initial performance tests
The goals for this milestone were to successfully install the system, which consists of the databases (Cassandra and
Accumulo) and the testing framework (YCSB++), on a single node. Once completing this, we wanted to run some
initial testing of Cassandra prior to the additional cell-level security in order to produce a performance report we shall
use for comparison in the following steps.
Overall, we managed to complete the milestone as planned and moreover, we succeeded in extending the system to
two nodes.
This is quite a breakthrough given the difficulties we experienced with the installations. And it brings us that much
closer to achieving the goal in milestone#2, which is running a system consisting of six nodes.
Progress Compared to Plan
Step Plan
Status
Install and run Cassandra
Install and run YCSB++
Run some initial manual testing of Cassandra
Connect YCSB++ to Cassandra and run benchmark tests
Complete
Complete
Complete
Complete
Install Accumulo
Complete
Step Plan detailed:
Install and run Cassandra – Using the lab computers, we have installed Cassandra according to the Cassandra
documentation found on the Apache Cassandra website. At this moment Cassandra database is configured and
capable to run in 2 different modes.
The first configuration is of one cluster consisting of one node which manages all the keys and values in the database.
The second configuration is of one cluster consisting of two nodes which share the keys values and they manage and
store 50% of the database each.
Install and run YCSB++ - We build the YCSB++ source code using the maven commands according to the
instructions of the build file supplied in the project code. We used YCSB++ over the "basic" database in order to
practice the benchmark framework and features. This was supplied as one of the bindings at YCSB++ databases.
Initial manual testing of Cassandra – We used Cassandra client shell in order to create keyspaces, column families,
add a column within a family and for storing and retrieving key names and values.
Cassandra supplies statistics for these manual operations so we could get the idea of how much time each operation
consumes.
Connect YCSB++ to Cassandra and run benchmark tests – We used Cassandra-10 client binding supplied by the
YCSB++ database in order to connect to the Cassandra database. We ran some core benchmark tests and the results
are further detailed later on in this document.
Install Accumulo – We have installed, configured and ran - apache Zookeeper and apache Hadoop as they are
prerequisites for the Accumulo database. We have configured Accumulo according to the manual documentation and
we ran the Accumulo client shell in order to test its framework.
Exact science Faculty - Tel Aviv University
1
Workshop in Information Security – Distributed Databases Project
By: Ilia Oshmiansky, Ainat Chervin and Yosi Barad
Initial Tests Description & Results
For the basic tests, we used core workloads that are included with the YCSB installation and ran them all 8 times
each. Each time we increased the number of threads.
The core workloads consist of six different workloads:
Workload A: Update heavy workload - This workload has a mix of 50/50 reads and writes.
Workload B: Read mostly workload - This workload has a 95/5 reads/write mix.
Workload C: Read only - This workload is 100% read.
Workload D: Read latest workload - In this workload, new records are inserted, and the most recently inserted
records are the most popular.
Workload E: Short ranges - In this workload, short ranges of records are queried, instead of individual records.
Workload F: Read-modify-write - In this workload, the client will read a record, modify it, and write back the
changes
First we ran the tests from one client pc to a Cassandra server consisting of a single node and next we added another
Cassandra node and re-conducted the same tests.
We ran these tests for two reasons:
1) Establish a baseline by which future results (post implementation of cell level ACL) will be judged.
2) Establish the maximal throughput of Cassandra on a single node.
3) Compare the performance of a Cassandra with one node to Cassandra with two node.
We increased the number of threads until we got a stable result. This was done in order to determine the actual
throughput of the server.
All the tests were run with 1000 ops (reads/writes/scans or combined). Throughputs are in ops/sec
The results we got are described in the following graphs:
a) Results with the one Cassandra node -
Exact science Faculty - Tel Aviv University
2
Workshop in Information Security – Distributed Databases Project
By: Ilia Oshmiansky, Ainat Chervin and Yosi Barad
From these results (workload c) we can gather that the current maximal READ throughput stands at around
3000ops/sec.
Exact science Faculty - Tel Aviv University
3
Workshop in Information Security – Distributed Databases Project
By: Ilia Oshmiansky, Ainat Chervin and Yosi Barad
b) Results with the two Cassandra nodes compared to one node -
Exact science Faculty - Tel Aviv University
4
Workshop in Information Security – Distributed Databases Project
By: Ilia Oshmiansky, Ainat Chervin and Yosi Barad
Exact science Faculty - Tel Aviv University
5
Workshop in Information Security – Distributed Databases Project
By: Ilia Oshmiansky, Ainat Chervin and Yosi Barad
Running these tests (Cassandra on two nodes) we noticed a general degradation in performance, probably due to the
synchronization overhead between the two nodes. More work has to be done in order to explain these results. (see
plans ahead)
Plans for ahead
Having completed milestone one successfully, we plan to move on to the next stage.
This means we are going to advance in 3 different directions:
1) Get into the Cassandra code and start the cell-level ACL implementationWe have already started discussing implementation options with the advice of Alex form IBM.
There are two main options:
a) Using the Cloud Data Management Interface that provides an API to configure the NFSv4 ACL
(commonly used in Linux systems) using JSON strings. These strings are sent as part of the HTTP
requests; in our case they can be stored in Cassandra.
b) Adding simple strings like: "(Alice, rx) (Bob, rwxo) (Charlie, rx) ..." we can store in Cassandra as is and
when Alice will try to read a file from Cassandra we will check that the ACL allows her to do so.
We will look into both options, but we feel that the first option involves greater knowledge in ACLs format and
working with json strings and that it may add unnecessary overhead we wish to avoid since it is less
important to our project. Therefore, we will first explore the second option and see form there.
Exact science Faculty - Tel Aviv University
6
Workshop in Information Security – Distributed Databases Project
By: Ilia Oshmiansky, Ainat Chervin and Yosi Barad
2) Extend our Accumulo and Cassandra setups to include several clustersThis stage is critical in order to get real meaningful test results and for finding security holes in the later
stages.
3) Improve our testing environmentThis stage includes the following:
a) Write custom tests for our specific ACL implementations.
b) Improve the testing environment by running the test from several PCs simultaneously to make
sure the test-pc is not the limiting factor.
c) Edit the test configurations to support the tests we planned to use according to our test plan (See
section 'Testing scenarios' in the Project Plan document)
d) Run diverse tests to understand the limiting factors in each test (might be the testing equipment,
CPU-time, disk I/O, network limitations, synchronization overhead between nodes and much
more). and if possible - change the setup to eliminate this limiting factor.
e) Analyze the CPU and disk usage of the machines to understand the results better.
Exact science Faculty - Tel Aviv University
7