What is Big Data? - Avi Freedman`s Tech and Biz Topics

Big Data
Tools Overview
Avi Freedman
ServerCentral
Technology Executives Club
November 13, 2013
What is Big Data?
Canonical definition



Volume: Billions or trillions of rows
Variety: Different schemas
Velocity: Hundreds of thousands of records/sec
Traditional systems have difficulty handling data in these dimensions,
even with scale-up, partitioning and sharding.
Clustered/scale-out solutions are required to solve old problems in
new ways.

But… That causes problems meeting traditional database integrity
requirements.
Our focus will be on open source and < $1 million (min) technology stacks

Not focusing on traditional BI and Teradata/traditional “Make SQL work” soltuions
Big Data Search Trends
Source: Google Trends
Big Data
Data Mining
Semantics
Tech Background: Scale-up vs Scale-out
To Scale up you buy a bigger machine. But there
are limits to how far you can go…
Scaling out with traditional software designed for
single-machine architectures is typically done by
making read replicas (doesn’t help for volume or
write-heavy workload);
Or clustering with master/master architectures, which
still doesn’t help with volume and can increase
latency;
Or with sharding or partitioning…
Tech Background: Sharding/Partitioning
When you shard a database, you split it by sets of
the data, typically related to the key (so names
starting with “A-C” go one place, “D-F” another,
etc). Can be difficult to do manually.
Partitioning is usually implemented by slicing the
database into separate tables (often all on the same
machine) by time.
Tech Background: ACID and CAP
Big Data solutions typically “relax” ACID and are subject to the CAP
Theorem.
ACID




CAP Theorem says….
Atomicity (transactions are all or
nothing)
Consistency (checking the end
results)
Isolation (transactions don’t
affect each other)
Durability (transactions once
committed are forever




Consistency (all nodes have the
same data)
Availability (every request gets a
response)
Partition tolerance (any part can
fail)
Can’t have all 3
Big Data Technologies
Map/Reduce (Hadoop)


Hadoop, HPCC
(emerging) Streaming Databases
NoSQL




Key/Value Store
Document/Scheme-Free Databases
Columnar (Dremel, Impala, Drill)
Graph databases (for social media)
NewSQL
Revival of Classic SQL DBs
NoSQL Introduction
Not ACID
(Problem for… Funds transfer, power failures, selling the last item twice)
High volume
Clustered
Partial to full SQL
No stored procedures
NoSQL: Map/Reduce

Currently mainly used for batch processing, but streaming is being
grafted on.

Older versions had single points of failures but newer versions have
implemented system-wide redundancy

Not ACID, though there is some basic “check and set” functionality in
underlying databases.

There are SQL-like interfaces (Hive and Pig)

Latency is typically VERY high – minutes – to get queries.

And to be efficient, the map/reduce processes are usually written in
Java, which is an obstacle for use in many environments.
NoSQL: Key/Value Store
One of the first examples of “NoSQL” software was the set of systems developed to deal
with Key/Value lookups.
In this kind of system, you get to set, delete, or read a key (like “cloud services”) and get
one value (like “are fun”). Values can be lists or even more complex data structures.
Sample applications:




Web cookies
State for massively online games
Real-time ad placement
Fraud and intrusion detection
monitoring
NoSQL: Key/Value Store
Leading packages that implement Key/Value stores are:
memcached (which clusters but isn’t persistent)
redis (clusters and is persistent to disk)
riak (clusters, persistent to disk, goes up the chain a bit, but not as
performant if disk I/O kicks in)
NoSQL: Document DBs
 Related to Key/Value store


Typically a superset where you get a key, but the value
can be a large structured set of data (a “document”).
Usually have more sophisticated ability to do patternmatching lookups.
 MongoDB is the thought if not market leader
 Riak and Couchbase are be second in the space
 All are still evolving, not perfect, and require some
tuning.
NoSQL: Columnar DBs
 Older-generation columnar databases like HBase (part of mysql)
were clustered but not fast enough to ‘move the needle’.
 Newer implementations, inspired by Google’s Dremel, like Apache
Drill and Cloudera Impala are ordered of magnitude faster (in the
seconds for some queries), also cluster, and can deal even better
with large ingest volumes and variable schemas.
 Apache Drill is just out in alpha, and Impala has yet to achieve the
performance of Google’s hosted Dremel service.
 But these systems may be the closest to threatening the typical Aster
and even core Teradata use cases.
NoSQL: Clustered SQL
 Cassandra does offer SQL access and
clusters, but is not ACID.
 Used by many web-scale companies.
 Also relatively steep learning curve though
there are commercial providers to assist
NoSQL: Graph DBs
 Systems like neo4j have been evolved to deal with problems that arise
in heavily-connected data where one is looking for instances or
patterns in the relationships between items.
 One key space where they are used is in social networks and for evil
government projects.
 Very specialized and we don’t see many instances deployed in the
enterprise.
NewSQL
NewSQL was coined recently to describe databases that attempt to cluster
(scale-out) and maintain ACID properties.
Two leaders are:
•
•
FoundationDB (currently has a 96 core and 100TB limit), does not require in-RAM
presence
VoltDB (doing complex work requires Java skills, and it is costly because all data must
fit in-RAM across the nodes
People are watching this space with interest, but many are dubious about how
fast they will develop into truly scalable offerings.
Both offer easy access for download and testing in customer environments.
Revival of Classic SQL DBs

Microsoft, Oracle/mysql, postgres, and MariaDB
projects/companies are all thinking and
implementing more scale-out functionality.

Most of the initial approaches seem to be
automating the process of sharding and partitioning
databases.

We see people trying this most in the mysql
community, but the vast majority are still sharding
and replicating to deal with scale.
Gotchas to Watch For
ACID Compliance?



Do you get transactional ‘correctness’
Or is the system ‘eventually consistent’
At high constant volumes, eventual consistency may never catch
up
Ease of use



Non-SQL systems (like Map/Reduce) can be difficult to learn and
train for
And many Big Data systems can be difficult to learn/install for DB
admins
Commercial solutions can address this but also can cost 10x
Gotchas to Watch For
What are your application requirements of the
data backends?
Application support

If you don’t code your applications, will they
support using a big data solution on the backend?

It’s sometimes possible to write ‘adapters’
underneath commercial applications, but this is
dangerous as the applications may change their
schemas or methods without notifying users.
Questions?
Avi Freedman
ServerCentral
[email protected]
Technology Executives Club
November 13, 2013