Big Data – Making Sense of the Hype Turning Hype into value

BIG DATA ANALYTICS–
WHAT IT MEANS TO THE AUDIT
COMMUNITY – PART 1
Markus Hardy, PMP, CISSP,CISA
Markus Hardy, PMP, CISSP, CISA.
2
As a Managing Principal of a new start up venture designed to develop and deploy Next
Generation technologies, collaborative learning systems and workplaces, Markus brings a
wealth of knowledge and experience to enable companies to transform by deploying the Next
Generation of processes, skills and systems. He has over 20 years of private and public
advanced systems knowledge and experiences to add to his consulting work which includes
previously managing Defense Advance Research Projects Agency (DARPA) related projects
while on Active Duty in the US Army. In his corporate endeavors, he has helped companies
and universities with technology transfer initiatives within their Innovation and Incubation
Centers. He has provided high thought leadership for strategic initiatives in Health Care,
Education, Financial Services, Supply Chain Management, Process Improvement, Business
Transformation, Mergers & Acquisitions, and research for business development in emerging
markets.
He has conducted extensive research in order to provide value to key players engaged in
systems development efforts to ensure compliance with the Affordable Care Act
(ACA)/HealthCare. In addition, he is engaged in projects that create advanced Collaborative
WorkLife Systems to include the delivery of Virtual Knowledge, and Big Data Analytics. Prior
to this original research on the emerging Big Data space, he helped to develop an Analytics
Maturity Model (AMM) which was well received as a presentation to leaders at a major health
services company. He has been instrumental in leading an Initial Public Offering (IPO) for a
local financial services conglomerate and has provided investor relations services both inside
companies and on a professional consulting basis.
As an entrepreneur, he built and owned a software development company. He holds a BS
from Indiana University as a Distinguished Military Graduate (DMG), a Master of Science
from Central Michigan in Hospital Administration. He is a Commissioned Army Officer with
over 17 years on Active Duty with command and staff assignments with US Army Europe, 2nd
Infantry Division/Korea and US Forces Command (FORSCOM).
3
CONFIDENTIAL - NOT FOR DISTRIBUTION
4
Agenda
• Big Data
• What is it / Types / Stats / New forms and growth
• Value it might bring
• The four “V’s”
• Concerns
• Analytics, Business Intelligence and Big Data Analytics
• What are they and how are they different
• Stacks and the different types
• Hadoop Ecosystem
• Auditing Big Data
• Planning, Scope, Considerations
• Operational Considerations
CONFIDENTIAL - NOT FOR DISTRIBUTION
5
What is Big Data?
"Big data is high-volume, high-velocity, and/or high-variety
information assets that require new forms of processing to enable
enhanced decision making, insight discovery and process
optimization. “Douglas, Laney. "The Importance of 'Big Data': A Definition". Gartner. Retrieved 21 June 2012.
“Elements of Big Data" include:
•
The degree of complexity within the data set
•
The amount of value that can be derived from innovative
vs. non-innovative analysis techniques
•
The use of longitudinal information supplements the
analysis” “Mike2.0 Web Site”
Webopedia
CONFIDENTIAL - NOT FOR DISTRIBUTION
6
What is Big Data?
“Big data is a buzzword, or catch-phrase, used to describe a
massive volume of both structured and unstructured data
that is so large that it's difficult to process using traditional
database and software techniques”.
CONFIDENTIAL - NOT FOR DISTRIBUTION
Webopedia
7
Big Data Types and Stats
“ 5 Billion - Mobile Phones in use in 2010”
“ 30 Billion - Pieces of content shared on FaceBook every month”
“ 40 % - Projected growth in global data per year
vs.
5 % - Growth in global IT spending”
“
235 - Terabytes of data collected by the US Library of Congress in
April 2011”
“ 15 out of 17” - Market segment companies have more data stored
per company than the Library of Congress”
“Big Data: The next frontier for innovation, competition and productivity". McKinsey Global Institute (MGI), May 2011.
CONFIDENTIAL - NOT FOR DISTRIBUTION
8
Big Data – New forms and growth
Over
the
next
“ > 1 billion – New Smart Phones
will enter service”
3
years
“ > 3 billion – More IP Enabled Devices
will enter cyberspace by 2015”
to
“ > 4.9 million – More Patients will use
Remote Health Monitoring Devices”
6
“ > 142 million – Additional Healthcare and Medical
app downloads”
years
“Big Data: The next frontier for innovation, competition and productivity". McKinsey Global Institute (MGI), May 2011.
CONFIDENTIAL - NOT FOR DISTRIBUTION
9
CONFIDENTIAL - NOT FOR DISTRIBUTION
“CSC Infographic”
10
Big Data – An example of explosive growth
2012: HealthCare Data is
2020
500 Petabytes [ Worldwide HealthCare Data ] 25,000 Petabytes
is expected to grow to
50 times
“Big Data: The next frontier for innovation, competition and productivity". McKinsey Global Institute (MGI), May 2011.
CONFIDENTIAL - NOT FOR DISTRIBUTION
11
CONFIDENTIAL - NOT FOR DISTRIBUTION
“CSC Infographic”
12
Value of Big Data?
“ 300 Billion – Potential Annual Value of US Health Care data. More
than double that of the health care spending in Spain”
“ € 250 Billion – Public Annual Value to Europe’s Public Sector
Administration. This is more than the GDP of Greece”
“ 140 - 190 thousand – More positions required of those that have
deep analytical talent”
“ 60% - Potential increase in Retailer’s operating margins with use
of Big Data
“ 1.5 million - More data savy managers needed
who can take advantage of Big Data”
“Big Data: The next frontier for innovation, competition and productivity". McKinsey Global Institute (MGI), May 2011.
CONFIDENTIAL - NOT FOR DISTRIBUTION
13
Value of Big Data –
“ Big Data is important, but so is your existing RDBMS, your SAP
ERP System, and your Salesforce.com data. The new kid on the block
needs to play well with what you have already. ”
“ What good is Big Data if you can’t get to it readily, or integrate it
with your business?”
“Vincent Lam, Marketing Director, Information Builders, Best Practices Series, Data Management in the Era of Big Data”, TDWI Magazine,".
, March 2012.
CONFIDENTIAL - NOT FOR DISTRIBUTION
14
CONFIDENTIAL - NOT FOR DISTRIBUTION
“CSC Infographic”
15
Volume- exceeds physical limits of vertical
scalability
Velocity – decision window small
Variety – many different formats makes
integration expensive
Variability – many options or variable
interpretations confound analysis
compared to data change rate
Extremes of Volume or
Velocity maybe better
handled by BI up to a point
Big Data
Velocity
As data Variety
and/or Variability
increase, Big Data
becomes more
attractive
Traditional
BI
CONFIDENTIAL - NOT FOR DISTRIBUTION
Volume
“Promotional Webinar: Expand Your Digital Horizon with Big
Data,” Brian Hopkins and Boris Evelson, Forrester, September 7, 2011.
16
Big Data solutions trade
off consistency and integrity
for speed and flexibility
Big Data
Source
Data
Transformation
and Analysis
Traditional BI
Source
Data
Integration
Value
Analysis
Increasing Time and Cost
“Promotional Webinar: Expand Your Digital Horizon with Big Data,” Brian Hopkins and Boris Evelson, Forrester, September 7, 2011.
CONFIDENTIAL - NOT FOR DISTRIBUTION
Value
17
Concerns
Big data is so large it overburdens
traditional data management solutions
New enterprise trends & insights
awaiting discovery may not be
optimized
Unstructured content is on pace to
overwhelm traditional structured
data and needs to be unlocked
Imperative that companies manage
Big Data in a holistic fashion, not
as a silo so they can provide
provide contextual meaning
of business performance
Big Data technologies (Hadoop)
can enable useful processing
of data in batch mode . May have
challenges with small data sets
IT must now invest $$$ to provide
Big Data extracts to avoid
(spinning the hourglass)
What A Big Data challenge might seem like to some
http://www.youtube.com/watch?NR=1&v=8NPzLBSBzPI&feature=fvwp
“TDWI, Best Practices Series, Data Management in the Era of Big Data. Database Trends and Applications Magazine (DBTA), March 2011.
CONFIDENTIAL - NOT FOR DISTRIBUTION
18
Analytics
Analytics is
the discovery and
communication of
meaningful patterns
in data.
Especially valuable in
areas rich with recorded
information, analytics
relies on the simultaneous
application of statistics,
computer programming
and operations research
to quantify performance.
Analytics often favors
data visualization to
communicate insight.
CONFIDENTIAL - NOT FOR DISTRIBUTION
“Google Infographic”
19
Business Intelligence Business intelligence
(BI) is a set of theories,
methodologies, processes,
architectures, and technologies that
transform raw data into meaningful
and useful information.
BI can handle large amounts of
information to help identify and
develop new opportunities.
Making use of new opportunities
and implementing an effective
strategy can provide a competitive
market advantage and long-term
stability.
CONFIDENTIAL - NOT FOR DISTRIBUTION
20
The BI Stack -
CONFIDENTIAL - NOT FOR DISTRIBUTION
21
Microsoft BI Stack-
CONFIDENTIAL - NOT FOR DISTRIBUTION
“Microsoft Corp Infographic”
22
Oracle BI Stack-
CONFIDENTIAL - NOT FOR DISTRIBUTION
“Oracle Corp Infographic”
23
Big Data Analytics Stack
Big Data Analytics. Is the
process of analysis of a
collection of data sets so large
and complex that it becomes
difficult to process using onhand database management
tools or traditional data
processing applications.
The challenges include capture,
curation, storage, search,
sharing, analysis, and
visualization.
Preference is given to directattached storage (DAS) in its
various forms from solid state
disk (SSD) to high capacity
SATA disks buried inside
parallel processing nodes.
CONFIDENTIAL - NOT FOR DISTRIBUTION
“Cloudera Corp Infographic”
24
Hadoop Ecosystem (Expanded View)
Security
Access – Public |Hybrid |Private Cloud
UI |Portal
(Lab)
System Applications
Hadoop Core –
HDFS +
MapReduce
Infrastructure
Customer Defined
Applications
Support / Audit
Chukwa
Mounted Shared
Storage Visible to
all Hosts
Commodity Virtual
Servers
Self-Service via
Web Browser
Hbase/BigTable
iSCSI
JINFONET/
J Report
Hive QL
Fibre Channel
Birst
SDLC
Test User 1, 2,...
Mahout
NFS
Pig
ZooKeeper
Pentaho
Eclipse/Birt
RapidMiner
SpagoBI
SMB
Evolutionary Schema’s | Staging
CONFIDENTIAL - NOT FOR DISTRIBUTION
Jaspersoft
Ingress | Egress |Bulk Data Loading
R/Bioconductor
Programming \ Services (C++; C#; Python, R)
MPP/Columnar DB / NoSQL DB
25
Hadoop Related Technologies
•Hadoop, Apache's free and open source implementation of MapReduce.
•Pentaho - Open source data integration (Kettle), analytics, reporting,
visualization and predictive analytics directly from Hadoop nodes
•Nutch - An effort to build an open source search engine based on Lucerne
and Hadoop, also created by Doug Cutting
•Datameer Analytics Solution (DAS) - data source integration, storage,
analytics engine and visualization
•Apache Accumulo - Secure Big Table
•HBase - BigTable-model database
CONFIDENTIAL - NOT FOR DISTRIBUTION
26
Hadoop Related Technologies
•Hypertable - HBase alternative
•Apache Cassandra - A column-oriented database that supports access from
Hadoop
•HPCC - LexisNexis Risk Solutions High Performance Computing Cluster
•Sector/Sphere - Open source distributed storage and processing
•Algorithmic skeleton - A high-level parallel programming model for parallel
and distributed computing
•MongoDB - A scalable, high-performance, open source NoSQL database
•MapReduce-MPI MapReduce-MPI Library
CONFIDENTIAL - NOT FOR DISTRIBUTION
27
Map Reduce 101
CONFIDENTIAL - NOT FOR DISTRIBUTION
28
Map Reduce 101
Map (Key 1, Value 1)
List (Key 2, Value 2)
Client
Job Request
Hadoop
Job Parts
Map Reduce
Master Node
Job Parts
INPUT DATA
Reduce
Map
Reduce
Map
CONFIDENTIAL - NOT FOR DISTRIBUTION
OUTPUT DATA
Map
29
Hadoop Basic Tasks
Input reader: (1)
• Divides input into appropriate size 'splits' (in practice typically 16 MB to 128
MB)
• Framework assigns one split to each Map function.
• Reads data from stable storage (typically a distributed file system) and
generates key/value pairs.
Example. Read a directory full of text files and return each line as a record.
CONFIDENTIAL - NOT FOR DISTRIBUTION
30
Hadoop Basic Tasks
Map function: (2)
• Takes series of key/value pairs, processes each, and generates zero or more
output key/value pairs.
• Input and output types of the map can be (and often are) different from each
other.
• If the application is doing a word count, the map function would break the line
into words and output a key/value pair for each word.
• Each output pair would contain the word as the key and the number of
instances of that word in the line as the value.
CONFIDENTIAL - NOT FOR DISTRIBUTION
31
Hadoop Basic Tasks
Partition function: (3)
• Each Map function output is allocated to a particular reducer by the
application's partition function for sharding purposes.
• The partition function is given the key and the number of reducers and returns
the index of the desired reduce.
• A typical default is to hash the key and use the hash value modulo the number
of reducers. It is important to pick a partition function that gives an
approximately uniform distribution of data per shard for load-balancing
purposes, otherwise the MapReduce operation can be held up waiting for
slow reducers (reducers assigned more than their share of data) to finish.
• Between the map and reduce stages, the data is shuffled (parallel-sorted /
exchanged between nodes) in order to move the data from the map node that
produced it to the shard in which it will be reduced. The shuffle can sometimes
take longer than the computation time depending on network bandwidth, CPU
speeds, data produced and time taken by map and reduce computations.
CONFIDENTIAL - NOT FOR DISTRIBUTION
32
Hadoop Basic Tasks
Comparison function: (4)
• The input for each Reduce is pulled from the machine where the Map ran and
sorted using the application's comparison function.
Reduce function: (5)
• The framework calls the application's Reduce function once for each unique
key in the sorted order.
• Reduce can iterate through the values that are associated with that key and
produce zero or more outputs.
• In the word count example, the Reduce function takes the input values, sums
them and generates a single output of the word and the final sum.
Output writer: (6)
• Writes the output of the Reduce to the stable storage, usually a distributed file
system.
CONFIDENTIAL - NOT FOR DISTRIBUTION
33
Transition – Big Data to Big Data Analytics
CONFIDENTIAL - NOT FOR DISTRIBUTION
34
Big Data Analytics
Big data analytics systems
thrive on system
performance, commodity
infrastructure, and low cost.
Real or near-real time
information delivery is one of
the defining characteristics of
big data analytics.
Latency is therefore avoided
whenever and wherever
possible.
Data in memory is good—
data on spinning disk at the
other end of a SAN
connection is not.
CONFIDENTIAL - NOT FOR DISTRIBUTION
35
IBM’s View of Big Data
CONFIDENTIAL - NOT FOR DISTRIBUTION
“IBM Corp Infographic”
36
IBM’s View of Big Data
CONFIDENTIAL - NOT FOR DISTRIBUTION
“IBM Corp Infographic”