DOAG Big Data Spark SQL

DOAG – Big Data
Big Data Spark SQL-Hadoop: How Do I Access my Data?
Jan Ott, Trivadis Business Intelligence
BASEL BERN BRUGG DÜSSELDORF
HAMBURG COPENHAGEN LAUSANNE
FRANKFURT A.M. FREIBURG I.BR. GENEVA
MUNICH STUTTGART VIENNA ZURICH
Who am I?
Jan Ott
Senior Consultant
Roles:
Oracle Business Intelligence
Data Scientist
Trainer
Trivadis AG
Sägereistrasse 29
CH-8152 Glattbrugg
[email protected]
2
Big Data Spark SQL-Hadoop
Agenda
3
1.
Introduction
2.
First Steps in the Spark World
3.
Projects
4.
Summary
Big Data Spark SQL-Hadoop
Introduction
4
Big Data Spark SQL-Hadoop
Introduction
A few words about Big Data
– Big Data
– Hadoop
– Why Spark, Spark SQL?
Spark SQL – my first steps
– Get some data into Hadoop
– Tables in Spark - Hive
– Use SQL
– Diverse
Project – Twitter
5
Big Data Spark SQL-Hadoop
Big Data: Introduction
Big Data
– Turning Data into Insights
Hadoop and its Zoo
– HDFS – MapReduce
– SQL – Impala, HBase, Hive, …
– Zookeeper
– Spark and Spark SQL
NoSQL Databases
Architecture
– LAMBDA
6
Big Data Spark SQL-Hadoop
What is Spark
Apache Spark™ (Apache web head line)
is a fast and general engine for large-scale data processing.
Spark (Wiki)
– cluster computing framework Spark
– Interface for programming entire clusters
– Implicit data parallelism
– Fault-tolerance
– An Apache Open Source Project
– Developed by UC Berkeley
Goal
– Lightning-fast cluster computing
Performance
– Faster 10 x on disc – 100x in memory
7
Big Data Spark SQL-Hadoop
What is Spark (2)
Spark Parts
– Core
– SQL
– ML Lib – machine learning
– Streaming
– GraphX
Spark SQL and HIVE
– Working with structured data
– SQL inside Spark programs
– HIVE metadata store
– JDBC/ODBC
8
Big Data Spark SQL-Hadoop
What is Spark (3)
Runs everywhere
– Hadoop HDFS - YARN
– Mesos
– Cassandra
– HBase
– S3
–…
9
Big Data Spark SQL-Hadoop
What is Spark (4)
Running on YARN
– Spark Driver
– Spark Application Master
– Spark Executor
10
Big Data Spark SQL-Hadoop
What is Hadoop
a file system – HDFS
– Based on papers from Google
– Apache Open Source Project
Goal
– Fast
– Handles huge amount of data
– Handles unstructured to fully structured data
– Horizontally scalable
– Reliable
11
Big Data Spark SQL-Hadoop
First Steps in the Spark World
12
Big Data Spark SQL-Hadoop
First Steps
Keep it simple
Get some data into Hadoop
Get some data into Spark - Hive
Java – keep it to a minimum
Data small
Get an environment that is setup
– Google Cloud – Big Data
– Pick one way to get the data into Spark - Hive
See SQL on a HDFS system with Spark
13
Big Data Spark SQL-Hadoop
Pre-Requisite – Environment
Google Cloud – Big Data
– Web Browser
– https://console.cloud.google.com/
Contains
– Hadoop
– Hive
– Spark
14
Big Data Spark SQL-Hadoop
Google
Cloud
Platform
Big Data
15
Big Data Spark SQL-Hadoop
The Steps – simple – focus
HIVE
Table
16
Big Data Spark SQL-Hadoop
SQL
Query
Step 1 – Data – 2 files
emp.txt and dept.txt
– Comma delimited
– Flat file
– Format the date so it fits the standard date format
• YYYY-MM-DD HH24:MI:SS.XXXX
17
Big Data Spark SQL-Hadoop
Step 1 – Data
dept.txt
1,
2,
3,
4,
5,
IT Department, New York
Human Resource, Berlin
Development, Basel
Sales, London
Finanze, Paris
emp
1 Hans
2 Stefan
3 Susanne
4 Paul
5 Monika
...
18
Meier
Müller
Kieser
Steiner
Hausmann
3000
5000
3500
4000
7000
1968-02-02
1970-10-15
1972-03-14
1960-07-28
1975-03-29
Big Data Spark SQL-Hadoop
00:00:00
00:00:00
00:00:00
00:00:00
00:00:00
2000-01-01
2001-07-01
2005-05-01
2000-01-01
2000-01-01
00:00:00
00:00:00
00:00:00
00:00:00
00:00:00
1
1
2
2
3
DEMO
Google Cloud Big Data
– Spark SQL – CLI
– Spark / HIVE / Hadoop
19
Big Data Spark SQL-Hadoop
Projects
20
Big Data Spark SQL-Hadoop
Project – Figures
400 – 500 Mio tweets per day
1 tweet contains
– Around 50 metadata pieces
• Geo-location
• Re-tweets
• Followers
– That is about 2 A4 pages
Twitter Sample Stream
– 1%
– 4-5 Mio tweets per day
– 50 tweets per second
20 other streams with defined key words
HDFS
– 1 TB every 2 months including replication
21
Big Data Spark SQL-Hadoop
The Lambda Architecture - adopted
Batchlayer
AllData
(HDFS)
Twitter
API
JavaAPP
Messaging
Kafka
Batch(re)compute
QFD= Query
Focused
Data
QFD1
Hadoop
Pre-computed
Views
(Spark)
…
QFD2
Batchviews
Realtime views
QFD1
QFD2
Process Realtime Increment
Stream
QFDn
Impala
Cassandra
… QFDn
Incremented
Views
Storm
Speedlayer
22
Big Data Spark SQL-Hadoop
Serving
layer
Consumer
layer
Query
&
Merge
REST
Client
Web
App
Summary
23
Big Data Spark SQL-Hadoop
Summary
A new World
Spark, Hive, Hadoop and … it’s a zoo
– VM Oracle Big Data Light – CDH 5.5.1 – Spark 1.5 – Spark SQL CLI
does not run
– VM Cloudera – CDH5.5.0.2 – Spark 1.5 - Spark SQL CLI not installed
– Install it by myself into these VM’s… not a good idea
– Google – Version 1 – Spark 1.6 contains Spark SQL CLI
Lots can be done with RDBMS
Start to collect now
24
Big Data Spark SQL-Hadoop
Why Spark - SQL
SQL
– Known
– Analysts can used it
JDBC
– Divers tools can connect and use it
No programming needed
Speed !
– Adhoc
– Batch
It is IN MEMORY – no limit – spills to disk
25
Big Data Spark SQL-Hadoop
Sources
Spark
– https://spark.apache.org
Oracle VM – Big Data Light
– http://www.oracle.com/technetwork/database/bigdata-appliance/oracle-bigdatalite2104726.html
Books:
– Big Data – MEAP by Nathan Marz
– Spark Cookbook by Rishi Yadav
– Learning Spark by Matei Zaharia, Patrick Wendell, Andy Konwinski, Holden Karau
Pictures
– Oracle.com
– Twitter.com
– Apache.com
– Cloudera.com
26
Big Data Spark SQL-Hadoop
Jan Ott
Senior Consultant Zurich BI
Tel. +41 58 459 51 35
[email protected]