- Big Data Science Training


Facebook, Twitter, Google generating petabytes of data everyday.

Hadron Collider project discarding large amount of data as they won’t be able to
analyse. Hoping that they haven’t thrown anything valuable.
Interesting facts but …. Why is Big Data important?
Lets understand via an example
Insurance
3rd Party Survey
Expert Debates
Optimal
Price?
Bank
Optimal Price
Maximise Profit
Decision Support
Data Warehousing
System
Insurance
Repository
Competitors Pricing
Web Activity
Market Trends
Bank
Run Statistical
Algorithms
Maximise Profit
Transaction
Statistics
Data
Warehouse
Optimal Price
Optimal
Price?
Volume
Variety
Velocity
Decision Support
System
Data Warehousing
Insurance
Repository
Competitors Pricing
Web Activity
Market Trends
Bank
Run Statistical
Algorithms
Maximise Profit
Transaction
Statistics
Data
Warehouse
Optimal Price
Optimal
Price?
Decision Support
System
Fundamental
block to
Digital
Nervous
System
Organisations behaving like
Biological nervous system
Business @
speed of
thought
Sense
Fundamental
Block to
Act
Data
Interpret
Data
Decide
Skynet
Avatar
Repository
Competitors Pricing
Web Activity
Market Trends
Mobile Alert with
Travel insurance
Bank
Transaction
Statistics
Optimal Price
From Gartner:

4.4 Million IT Jobs Globally to Support Big Data By 2015.
International Data Corporation’s (IDC) 6th annual study:

From 2005 to 2020, the digital universe will grow by a factor of 300, from 130
exabytes to 40,000 exabytes, or 40 trillion gigabytes

More than 5,200 gigabytes for every man, woman, and child in 2020.

From now until 2020, the digital universe will about double every two years.

33% of the digital data might be valuable if analysed, compared with 25%
today.
2003-04
1996-2000
2005-06
2010
2013
Dreadnaught
Big Data problem faced by
All Search engines
Google File System
And MapReduce Papers
and Mike
Hadoop spawns off
Nutch
Doug Joins
Cloudera
0.xx Releases of
hadoop
YARN/MapReduce 2/
Next Generation Hadoop
Price Advantage:
1. Clusters use commodity
hardware, cheaper than
one expensive server.
2. Software License is free.
User
Google MapReduce
file1
MapReduce
Name node
HDFS
Google File System
map
map
map
map
Data nodes
map
Reduce
Oozie
Yahoo
Pig
Facebook
Hive
MapReduce
HBase
Sqoop/Flume
Structured Stores
HDFS
Storm
Chukwa
Log collection
Kafka
Message broker
1. Complex Algorithms needs to be
correctly sensitive to week
correlations.
2. Complex Algorithms are thus
difficult to code and design.
Complex Algorithm
on a small dataset
Simple Algorithm
on a large dataset
Data Engineer
Data Scientist
Role
To engineer software solutions.
To solve business problems
using data.
Skills
More of programing and
technical skills and ability to
architect technical solutions.
Strong of Mathematical Skills
and understanding of statistical
Models.
-> Skeleton Version
-> All the ecosystems need
to be additionally installed.
-> Important ecosystem
members included.
-> Proprietary Hadoop code
written in C.
-> Integrated with Hadoop
ecosystem members.
-> Few Proprietary tools
like Enterprise Manager.
-> Based out of Apache
hadoop.
-> Supports .NET framework
-> Launches Hadoop
Distribution: Pivotal HD
Thank You!!!
Superstar-Doug!!!
A small fan :- Me
And the real Hadoop