BIG DATA ANALYTICS– WHAT IT MEANS TO THE AUDIT COMMUNITY – PART 1 Markus Hardy, PMP, CISSP,CISA Markus Hardy, PMP, CISSP, CISA. 2 As a Managing Principal of a new start up venture designed to develop and deploy Next Generation technologies, collaborative learning systems and workplaces, Markus brings a wealth of knowledge and experience to enable companies to transform by deploying the Next Generation of processes, skills and systems. He has over 20 years of private and public advanced systems knowledge and experiences to add to his consulting work which includes previously managing Defense Advance Research Projects Agency (DARPA) related projects while on Active Duty in the US Army. In his corporate endeavors, he has helped companies and universities with technology transfer initiatives within their Innovation and Incubation Centers. He has provided high thought leadership for strategic initiatives in Health Care, Education, Financial Services, Supply Chain Management, Process Improvement, Business Transformation, Mergers & Acquisitions, and research for business development in emerging markets. He has conducted extensive research in order to provide value to key players engaged in systems development efforts to ensure compliance with the Affordable Care Act (ACA)/HealthCare. In addition, he is engaged in projects that create advanced Collaborative WorkLife Systems to include the delivery of Virtual Knowledge, and Big Data Analytics. Prior to this original research on the emerging Big Data space, he helped to develop an Analytics Maturity Model (AMM) which was well received as a presentation to leaders at a major health services company. He has been instrumental in leading an Initial Public Offering (IPO) for a local financial services conglomerate and has provided investor relations services both inside companies and on a professional consulting basis. As an entrepreneur, he built and owned a software development company. He holds a BS from Indiana University as a Distinguished Military Graduate (DMG), a Master of Science from Central Michigan in Hospital Administration. He is a Commissioned Army Officer with over 17 years on Active Duty with command and staff assignments with US Army Europe, 2nd Infantry Division/Korea and US Forces Command (FORSCOM). 3 CONFIDENTIAL - NOT FOR DISTRIBUTION 4 Agenda • Big Data • What is it / Types / Stats / New forms and growth • Value it might bring • The four “V’s” • Concerns • Analytics, Business Intelligence and Big Data Analytics • What are they and how are they different • Stacks and the different types • Hadoop Ecosystem • Auditing Big Data • Planning, Scope, Considerations • Operational Considerations CONFIDENTIAL - NOT FOR DISTRIBUTION 5 What is Big Data? "Big data is high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization. “Douglas, Laney. "The Importance of 'Big Data': A Definition". Gartner. Retrieved 21 June 2012. “Elements of Big Data" include: • The degree of complexity within the data set • The amount of value that can be derived from innovative vs. non-innovative analysis techniques • The use of longitudinal information supplements the analysis” “Mike2.0 Web Site” Webopedia CONFIDENTIAL - NOT FOR DISTRIBUTION 6 What is Big Data? “Big data is a buzzword, or catch-phrase, used to describe a massive volume of both structured and unstructured data that is so large that it's difficult to process using traditional database and software techniques”. CONFIDENTIAL - NOT FOR DISTRIBUTION Webopedia 7 Big Data Types and Stats “ 5 Billion - Mobile Phones in use in 2010” “ 30 Billion - Pieces of content shared on FaceBook every month” “ 40 % - Projected growth in global data per year vs. 5 % - Growth in global IT spending” “ 235 - Terabytes of data collected by the US Library of Congress in April 2011” “ 15 out of 17” - Market segment companies have more data stored per company than the Library of Congress” “Big Data: The next frontier for innovation, competition and productivity". McKinsey Global Institute (MGI), May 2011. CONFIDENTIAL - NOT FOR DISTRIBUTION 8 Big Data – New forms and growth Over the next “ > 1 billion – New Smart Phones will enter service” 3 years “ > 3 billion – More IP Enabled Devices will enter cyberspace by 2015” to “ > 4.9 million – More Patients will use Remote Health Monitoring Devices” 6 “ > 142 million – Additional Healthcare and Medical app downloads” years “Big Data: The next frontier for innovation, competition and productivity". McKinsey Global Institute (MGI), May 2011. CONFIDENTIAL - NOT FOR DISTRIBUTION 9 CONFIDENTIAL - NOT FOR DISTRIBUTION “CSC Infographic” 10 Big Data – An example of explosive growth 2012: HealthCare Data is 2020 500 Petabytes [ Worldwide HealthCare Data ] 25,000 Petabytes is expected to grow to 50 times “Big Data: The next frontier for innovation, competition and productivity". McKinsey Global Institute (MGI), May 2011. CONFIDENTIAL - NOT FOR DISTRIBUTION 11 CONFIDENTIAL - NOT FOR DISTRIBUTION “CSC Infographic” 12 Value of Big Data? “ 300 Billion – Potential Annual Value of US Health Care data. More than double that of the health care spending in Spain” “ € 250 Billion – Public Annual Value to Europe’s Public Sector Administration. This is more than the GDP of Greece” “ 140 - 190 thousand – More positions required of those that have deep analytical talent” “ 60% - Potential increase in Retailer’s operating margins with use of Big Data “ 1.5 million - More data savy managers needed who can take advantage of Big Data” “Big Data: The next frontier for innovation, competition and productivity". McKinsey Global Institute (MGI), May 2011. CONFIDENTIAL - NOT FOR DISTRIBUTION 13 Value of Big Data – “ Big Data is important, but so is your existing RDBMS, your SAP ERP System, and your Salesforce.com data. The new kid on the block needs to play well with what you have already. ” “ What good is Big Data if you can’t get to it readily, or integrate it with your business?” “Vincent Lam, Marketing Director, Information Builders, Best Practices Series, Data Management in the Era of Big Data”, TDWI Magazine,". , March 2012. CONFIDENTIAL - NOT FOR DISTRIBUTION 14 CONFIDENTIAL - NOT FOR DISTRIBUTION “CSC Infographic” 15 Volume- exceeds physical limits of vertical scalability Velocity – decision window small Variety – many different formats makes integration expensive Variability – many options or variable interpretations confound analysis compared to data change rate Extremes of Volume or Velocity maybe better handled by BI up to a point Big Data Velocity As data Variety and/or Variability increase, Big Data becomes more attractive Traditional BI CONFIDENTIAL - NOT FOR DISTRIBUTION Volume “Promotional Webinar: Expand Your Digital Horizon with Big Data,” Brian Hopkins and Boris Evelson, Forrester, September 7, 2011. 16 Big Data solutions trade off consistency and integrity for speed and flexibility Big Data Source Data Transformation and Analysis Traditional BI Source Data Integration Value Analysis Increasing Time and Cost “Promotional Webinar: Expand Your Digital Horizon with Big Data,” Brian Hopkins and Boris Evelson, Forrester, September 7, 2011. CONFIDENTIAL - NOT FOR DISTRIBUTION Value 17 Concerns Big data is so large it overburdens traditional data management solutions New enterprise trends & insights awaiting discovery may not be optimized Unstructured content is on pace to overwhelm traditional structured data and needs to be unlocked Imperative that companies manage Big Data in a holistic fashion, not as a silo so they can provide provide contextual meaning of business performance Big Data technologies (Hadoop) can enable useful processing of data in batch mode . May have challenges with small data sets IT must now invest $$$ to provide Big Data extracts to avoid (spinning the hourglass) What A Big Data challenge might seem like to some http://www.youtube.com/watch?NR=1&v=8NPzLBSBzPI&feature=fvwp “TDWI, Best Practices Series, Data Management in the Era of Big Data. Database Trends and Applications Magazine (DBTA), March 2011. CONFIDENTIAL - NOT FOR DISTRIBUTION 18 Analytics Analytics is the discovery and communication of meaningful patterns in data. Especially valuable in areas rich with recorded information, analytics relies on the simultaneous application of statistics, computer programming and operations research to quantify performance. Analytics often favors data visualization to communicate insight. CONFIDENTIAL - NOT FOR DISTRIBUTION “Google Infographic” 19 Business Intelligence Business intelligence (BI) is a set of theories, methodologies, processes, architectures, and technologies that transform raw data into meaningful and useful information. BI can handle large amounts of information to help identify and develop new opportunities. Making use of new opportunities and implementing an effective strategy can provide a competitive market advantage and long-term stability. CONFIDENTIAL - NOT FOR DISTRIBUTION 20 The BI Stack - CONFIDENTIAL - NOT FOR DISTRIBUTION 21 Microsoft BI Stack- CONFIDENTIAL - NOT FOR DISTRIBUTION “Microsoft Corp Infographic” 22 Oracle BI Stack- CONFIDENTIAL - NOT FOR DISTRIBUTION “Oracle Corp Infographic” 23 Big Data Analytics Stack Big Data Analytics. Is the process of analysis of a collection of data sets so large and complex that it becomes difficult to process using onhand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, analysis, and visualization. Preference is given to directattached storage (DAS) in its various forms from solid state disk (SSD) to high capacity SATA disks buried inside parallel processing nodes. CONFIDENTIAL - NOT FOR DISTRIBUTION “Cloudera Corp Infographic” 24 Hadoop Ecosystem (Expanded View) Security Access – Public |Hybrid |Private Cloud UI |Portal (Lab) System Applications Hadoop Core – HDFS + MapReduce Infrastructure Customer Defined Applications Support / Audit Chukwa Mounted Shared Storage Visible to all Hosts Commodity Virtual Servers Self-Service via Web Browser Hbase/BigTable iSCSI JINFONET/ J Report Hive QL Fibre Channel Birst SDLC Test User 1, 2,... Mahout NFS Pig ZooKeeper Pentaho Eclipse/Birt RapidMiner SpagoBI SMB Evolutionary Schema’s | Staging CONFIDENTIAL - NOT FOR DISTRIBUTION Jaspersoft Ingress | Egress |Bulk Data Loading R/Bioconductor Programming \ Services (C++; C#; Python, R) MPP/Columnar DB / NoSQL DB 25 Hadoop Related Technologies •Hadoop, Apache's free and open source implementation of MapReduce. •Pentaho - Open source data integration (Kettle), analytics, reporting, visualization and predictive analytics directly from Hadoop nodes •Nutch - An effort to build an open source search engine based on Lucerne and Hadoop, also created by Doug Cutting •Datameer Analytics Solution (DAS) - data source integration, storage, analytics engine and visualization •Apache Accumulo - Secure Big Table •HBase - BigTable-model database CONFIDENTIAL - NOT FOR DISTRIBUTION 26 Hadoop Related Technologies •Hypertable - HBase alternative •Apache Cassandra - A column-oriented database that supports access from Hadoop •HPCC - LexisNexis Risk Solutions High Performance Computing Cluster •Sector/Sphere - Open source distributed storage and processing •Algorithmic skeleton - A high-level parallel programming model for parallel and distributed computing •MongoDB - A scalable, high-performance, open source NoSQL database •MapReduce-MPI MapReduce-MPI Library CONFIDENTIAL - NOT FOR DISTRIBUTION 27 Map Reduce 101 CONFIDENTIAL - NOT FOR DISTRIBUTION 28 Map Reduce 101 Map (Key 1, Value 1) List (Key 2, Value 2) Client Job Request Hadoop Job Parts Map Reduce Master Node Job Parts INPUT DATA Reduce Map Reduce Map CONFIDENTIAL - NOT FOR DISTRIBUTION OUTPUT DATA Map 29 Hadoop Basic Tasks Input reader: (1) • Divides input into appropriate size 'splits' (in practice typically 16 MB to 128 MB) • Framework assigns one split to each Map function. • Reads data from stable storage (typically a distributed file system) and generates key/value pairs. Example. Read a directory full of text files and return each line as a record. CONFIDENTIAL - NOT FOR DISTRIBUTION 30 Hadoop Basic Tasks Map function: (2) • Takes series of key/value pairs, processes each, and generates zero or more output key/value pairs. • Input and output types of the map can be (and often are) different from each other. • If the application is doing a word count, the map function would break the line into words and output a key/value pair for each word. • Each output pair would contain the word as the key and the number of instances of that word in the line as the value. CONFIDENTIAL - NOT FOR DISTRIBUTION 31 Hadoop Basic Tasks Partition function: (3) • Each Map function output is allocated to a particular reducer by the application's partition function for sharding purposes. • The partition function is given the key and the number of reducers and returns the index of the desired reduce. • A typical default is to hash the key and use the hash value modulo the number of reducers. It is important to pick a partition function that gives an approximately uniform distribution of data per shard for load-balancing purposes, otherwise the MapReduce operation can be held up waiting for slow reducers (reducers assigned more than their share of data) to finish. • Between the map and reduce stages, the data is shuffled (parallel-sorted / exchanged between nodes) in order to move the data from the map node that produced it to the shard in which it will be reduced. The shuffle can sometimes take longer than the computation time depending on network bandwidth, CPU speeds, data produced and time taken by map and reduce computations. CONFIDENTIAL - NOT FOR DISTRIBUTION 32 Hadoop Basic Tasks Comparison function: (4) • The input for each Reduce is pulled from the machine where the Map ran and sorted using the application's comparison function. Reduce function: (5) • The framework calls the application's Reduce function once for each unique key in the sorted order. • Reduce can iterate through the values that are associated with that key and produce zero or more outputs. • In the word count example, the Reduce function takes the input values, sums them and generates a single output of the word and the final sum. Output writer: (6) • Writes the output of the Reduce to the stable storage, usually a distributed file system. CONFIDENTIAL - NOT FOR DISTRIBUTION 33 Transition – Big Data to Big Data Analytics CONFIDENTIAL - NOT FOR DISTRIBUTION 34 Big Data Analytics Big data analytics systems thrive on system performance, commodity infrastructure, and low cost. Real or near-real time information delivery is one of the defining characteristics of big data analytics. Latency is therefore avoided whenever and wherever possible. Data in memory is good— data on spinning disk at the other end of a SAN connection is not. CONFIDENTIAL - NOT FOR DISTRIBUTION 35 IBM’s View of Big Data CONFIDENTIAL - NOT FOR DISTRIBUTION “IBM Corp Infographic” 36 IBM’s View of Big Data CONFIDENTIAL - NOT FOR DISTRIBUTION “IBM Corp Infographic”
© Copyright 2026 Paperzz