DOAG – Big Data Big Data Spark SQL-Hadoop: How Do I Access my Data? Jan Ott, Trivadis Business Intelligence BASEL BERN BRUGG DÜSSELDORF HAMBURG COPENHAGEN LAUSANNE FRANKFURT A.M. FREIBURG I.BR. GENEVA MUNICH STUTTGART VIENNA ZURICH Who am I? Jan Ott Senior Consultant Roles: Oracle Business Intelligence Data Scientist Trainer Trivadis AG Sägereistrasse 29 CH-8152 Glattbrugg [email protected] 2 Big Data Spark SQL-Hadoop Agenda 3 1. Introduction 2. First Steps in the Spark World 3. Projects 4. Summary Big Data Spark SQL-Hadoop Introduction 4 Big Data Spark SQL-Hadoop Introduction A few words about Big Data – Big Data – Hadoop – Why Spark, Spark SQL? Spark SQL – my first steps – Get some data into Hadoop – Tables in Spark - Hive – Use SQL – Diverse Project – Twitter 5 Big Data Spark SQL-Hadoop Big Data: Introduction Big Data – Turning Data into Insights Hadoop and its Zoo – HDFS – MapReduce – SQL – Impala, HBase, Hive, … – Zookeeper – Spark and Spark SQL NoSQL Databases Architecture – LAMBDA 6 Big Data Spark SQL-Hadoop What is Spark Apache Spark™ (Apache web head line) is a fast and general engine for large-scale data processing. Spark (Wiki) – cluster computing framework Spark – Interface for programming entire clusters – Implicit data parallelism – Fault-tolerance – An Apache Open Source Project – Developed by UC Berkeley Goal – Lightning-fast cluster computing Performance – Faster 10 x on disc – 100x in memory 7 Big Data Spark SQL-Hadoop What is Spark (2) Spark Parts – Core – SQL – ML Lib – machine learning – Streaming – GraphX Spark SQL and HIVE – Working with structured data – SQL inside Spark programs – HIVE metadata store – JDBC/ODBC 8 Big Data Spark SQL-Hadoop What is Spark (3) Runs everywhere – Hadoop HDFS - YARN – Mesos – Cassandra – HBase – S3 –… 9 Big Data Spark SQL-Hadoop What is Spark (4) Running on YARN – Spark Driver – Spark Application Master – Spark Executor 10 Big Data Spark SQL-Hadoop What is Hadoop a file system – HDFS – Based on papers from Google – Apache Open Source Project Goal – Fast – Handles huge amount of data – Handles unstructured to fully structured data – Horizontally scalable – Reliable 11 Big Data Spark SQL-Hadoop First Steps in the Spark World 12 Big Data Spark SQL-Hadoop First Steps Keep it simple Get some data into Hadoop Get some data into Spark - Hive Java – keep it to a minimum Data small Get an environment that is setup – Google Cloud – Big Data – Pick one way to get the data into Spark - Hive See SQL on a HDFS system with Spark 13 Big Data Spark SQL-Hadoop Pre-Requisite – Environment Google Cloud – Big Data – Web Browser – https://console.cloud.google.com/ Contains – Hadoop – Hive – Spark 14 Big Data Spark SQL-Hadoop Google Cloud Platform Big Data 15 Big Data Spark SQL-Hadoop The Steps – simple – focus HIVE Table 16 Big Data Spark SQL-Hadoop SQL Query Step 1 – Data – 2 files emp.txt and dept.txt – Comma delimited – Flat file – Format the date so it fits the standard date format • YYYY-MM-DD HH24:MI:SS.XXXX 17 Big Data Spark SQL-Hadoop Step 1 – Data dept.txt 1, 2, 3, 4, 5, IT Department, New York Human Resource, Berlin Development, Basel Sales, London Finanze, Paris emp 1 Hans 2 Stefan 3 Susanne 4 Paul 5 Monika ... 18 Meier Müller Kieser Steiner Hausmann 3000 5000 3500 4000 7000 1968-02-02 1970-10-15 1972-03-14 1960-07-28 1975-03-29 Big Data Spark SQL-Hadoop 00:00:00 00:00:00 00:00:00 00:00:00 00:00:00 2000-01-01 2001-07-01 2005-05-01 2000-01-01 2000-01-01 00:00:00 00:00:00 00:00:00 00:00:00 00:00:00 1 1 2 2 3 DEMO Google Cloud Big Data – Spark SQL – CLI – Spark / HIVE / Hadoop 19 Big Data Spark SQL-Hadoop Projects 20 Big Data Spark SQL-Hadoop Project – Figures 400 – 500 Mio tweets per day 1 tweet contains – Around 50 metadata pieces • Geo-location • Re-tweets • Followers – That is about 2 A4 pages Twitter Sample Stream – 1% – 4-5 Mio tweets per day – 50 tweets per second 20 other streams with defined key words HDFS – 1 TB every 2 months including replication 21 Big Data Spark SQL-Hadoop The Lambda Architecture - adopted Batchlayer AllData (HDFS) Twitter API JavaAPP Messaging Kafka Batch(re)compute QFD= Query Focused Data QFD1 Hadoop Pre-computed Views (Spark) … QFD2 Batchviews Realtime views QFD1 QFD2 Process Realtime Increment Stream QFDn Impala Cassandra … QFDn Incremented Views Storm Speedlayer 22 Big Data Spark SQL-Hadoop Serving layer Consumer layer Query & Merge REST Client Web App Summary 23 Big Data Spark SQL-Hadoop Summary A new World Spark, Hive, Hadoop and … it’s a zoo – VM Oracle Big Data Light – CDH 5.5.1 – Spark 1.5 – Spark SQL CLI does not run – VM Cloudera – CDH5.5.0.2 – Spark 1.5 - Spark SQL CLI not installed – Install it by myself into these VM’s… not a good idea – Google – Version 1 – Spark 1.6 contains Spark SQL CLI Lots can be done with RDBMS Start to collect now 24 Big Data Spark SQL-Hadoop Why Spark - SQL SQL – Known – Analysts can used it JDBC – Divers tools can connect and use it No programming needed Speed ! – Adhoc – Batch It is IN MEMORY – no limit – spills to disk 25 Big Data Spark SQL-Hadoop Sources Spark – https://spark.apache.org Oracle VM – Big Data Light – http://www.oracle.com/technetwork/database/bigdata-appliance/oracle-bigdatalite2104726.html Books: – Big Data – MEAP by Nathan Marz – Spark Cookbook by Rishi Yadav – Learning Spark by Matei Zaharia, Patrick Wendell, Andy Konwinski, Holden Karau Pictures – Oracle.com – Twitter.com – Apache.com – Cloudera.com 26 Big Data Spark SQL-Hadoop Jan Ott Senior Consultant Zurich BI Tel. +41 58 459 51 35 [email protected]
© Copyright 2026 Paperzz