Enterprise Analytics Platform
Tiber Training – May 19, 2017
Scott Person
Training Agenda
•
•
•
•
•
Data Lakes: Problem, Solution, & Benefits
Demo
Hands on
Questions
Review
2
Problem
Demand
Supply
IT is an obstacle rather than a
partner
IT struggles to consume, secure,
and expose the data produced by
the enterprise.
• Business Leader – “Moving
from concept to innovation is
slow and expensive”
Current solutions are rigid and lack
the ability to surge capacity
• Analyst – Either throwing data
away, not using it at all, or
waiting to get access
Storage does not scale elastically
Specific examples:
• Repeatedly running out of storage
both for archived raw data and on
our analytics platforms (virtual
machines)
• Months to get an indicator
exposed in the EDW.
3
Problem: Opportunity & Solution
“You can land your data in one place then access
it with various tools depending on need”
– Harsh Singh Microsoft MVP
Solution:
•
•
•
•
Storage – Amazon S3 or Azure Data Lake
Meta Data – Amazon Elastic Search or Azure Data Catalog
New Analytics – Hosted Hadoop; MPP engines; Spark; Power BI
Existing Analytics – Tableau, Business Objects, SAS, Excel
4
Solution Architecture
Compute
Data
Storage
EL – Extract & Load
Who knows?
Files
Files
Files
On Premises
Oracle
Email
Metadata
File
File
sDocs
s
5
Benefits
• 10-20x cheaper storage than traditional
on-premises data solutions – Bill
Schmarzo, CTO Dell EMC Services
QUANTITATIVE
• Regular data lake storage $40/TB/month decreasing to
$6/TB/month for cold storage (AWS)
•
Data is transformed and cleansed only when needed
–
–
QUALITATIVE
$
Storage:
$0.04-$0.006/GB/Mon
Bandwidth: only out of region
Compute: $0.03/Node/Minute
Catalog: $1/month/user
Power BI: $10/month/user
Lower cost
Maintain data fidelity
•
Goal: provide more analytical power and flexibility than a traditional data warehous
at lower cost than traditional on premises raw storage
•
Facilitates traditional data warehousing through persistent staging
•
Opens the door to industry leading analytical tools - Hadoop, Spark, Redshift,
machine learning while avoiding the complexities of running a cluster.
USE
CASES:
•
•
Persistent staging
Data Analytics
–
–
•
Exploration
Ad-hoc analysis i.e. Log files
No more silos
6
Data Warehouse vs Data Lake
http://www.kdnuggets.com/2015/09/data-lake-vs-data-warehouse-key-differences.html
7
Demo
• MS Azure Solution (much lower cost at rest than AWS)
– Data Catalog (search for airline flight data)
– Visual Studio – show code; make connection with file URL from catalog
– Two ways to submit jobs
▪ Visual Studio
▪ Azure Portal
– Azure Portal
▪ Explore content
▪ Jobs list
▪ Comparison job graph
–
–
–
–
Progress
Time
Read
Input/Output files
• So what do you do with the output?
– PowerBI
8
Hands On
https://portal.azure.com
Username: nnnnnnnnnn
Password: nnnnnnnnnn
1. Click on “YouNameHere”
2. Click “Duplicate Script”
3. Edit Job Name – make it your name
4. Change the output filename in line 46 to be yourname.csv
5. Click Submit Job
9
Wrap Up
• Questions?
10
Review – Questions not for Jim/Greg/Matt/Alison
1. What is a data lake?
2. What do we use a data lake for?
3. How does a data lake work?
11
Supplemental Slides
APPENDIX
12
Sample U-SQL Script
Ran in 1.2 minutes against 6M records in uncompressed flat files – no indexing or other
optimizations. Cost roughly $0.30.
@flights=
EXTRACT Year string,
Month string,
Day_Of_Month string,
Day_Of_Week string,
Unique_Carrier string,
…
FROM "/AirlineData/{*}.csv"
USING Extractors.Text(delimiter: ',');
@res =
SELECT Convert.ToDecimal(Day_Of_Month) AS Day_Of_Month, COUNT(Cancelled) AS Cancelled_Count
FROM @flights
WHERE
Convert.ToDecimal(Cancelled) > 0 AND Convert.ToInt16(Month) == 9
GROUP BY Day_Of_Month;
OUTPUT @res
TO "/Output/AirlineDataYear.csv"
ORDER BY Day_Of_Month
USING Outputters.Csv(outputHeader : true, quoting: false);
13
Business Model
• Provide basic functionality and services
–
–
–
–
Storage and meta-data management
Processes for access approval
Documentation for analytical tool configuration
Eat our dog food
• Analytics data source for the enterprise – and we can store all data
cheaper than any other solution.
• Inevitability – someone will (or should) do this
14
Route to Market
1. Identify a specific business problem
2. Determine the appropriate cloud platform
3. Implement basic functionality
– Storage
– Meta-data management
– Baseline analytics tools
4. Expand services
–
–
–
–
API for storing and retrieving files
Bring your own cluster and tools
NT Domain integration
Glacier archiving
15
Catalog of Technologies
• AWS
– S3 – distributed storage service in AWS (infinite scale and extremely
durable)
– Dynamo DB – No SQL database. In the data lake reference
implementation it is used to store an audit trail for each file
– AWS Lambda – container less code; In the reference implementation it
calls scripts to maintain the meta-data
– AWS ElasticSearch – provides the meta-data search capability for the
reference implementation catalog
– AWS Glue – emerging service from AWS to provide data integration
and catalog functionality (possibly based on Herd from FINRA)
• Azure
– Azure Data Lake – Storage and query data lake functionality based on
Hadoop
– Azure Data Catalog
– Azure Data Pump – Azure’s data integration service
16
Sample Code – U-SQL Schema on Read on Azure
Ran in 1.2 minutes against 6M records in uncompressed flat files – no indexing or other
optimizations. Cost roughly $0.30.
@searchlog =
EXTRACT Year string,
Month string,
Day_Of_Month string,
Day_Of_Week string,
Unique_Carrier string,
…
FROM "/AirlineData/{*}.csv"
USING Extractors.Text(delimiter: ',');
@res =
SELECT Convert.ToDecimal(Day_Of_Month) AS Day_Of_Month, COUNT(Cancelled) AS Cancelled_Count
FROM @searchlog
WHERE
Convert.ToDecimal(Cancelled) > 0 AND Convert.ToInt16(Month) == 9
GROUP BY Day_Of_Month;
OUTPUT @res
TO "/Output/AirlineDataYear.csv"
ORDER BY Day_Of_Month
USING Outputters.Csv(outputHeader : true, quoting: false);
17
Sample Code Output Imported into Excel
Cancelled Flights September 2011
20000
18000
16000
14000
12000
10000
Cancelled_Count
8000
6000
4000
2000
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
18
Sample Code
• Hive external table (accessible by Hive, Impala, and Presto)
– CREATE EXTERNAL TABLE myTable (key STRING, value INT)
LOCATION 's3n://mybucket/myDir';
• Copy data from S3 into a Redshift table
– copy <table_name> from
's3://<bucket_name>/<manifest_file>' authorization
manifest;
• Query S3 directly using Redshift Spectrum (not yet on GovCloud)
– create external table <table_name> (<field_list>) row
format delimited fields terminated by '\t' stored as
textfile location 's3n://mybucket/myDir';
19
Challenges
• Security certifications
– Azure
▪ Data Lake not FedRamp (HDInsight is in progress)
▪ Catalog not FedRamp (May not need to be – contains no business data)
– AWS
▪ ElasticSearch not FedRamp also not in GovCloud (May not need to be –
contains no business data; runs on EC2)
• Institutional direction – Cloud storage is very low hanging fruit; we
need to prevent silos
• Perceived competition with other initiatives
20
© Copyright 2025 Paperzz