Converged, Real-time Analytics

Technology Insight Paper
Converged, Real-time Analytics
Enabling F aster D ecision M aking a nd N ew B usiness Opportunities
By John Webster
February 2015
Enabling you to make the best technology decisions Enabling Converged, Real-time Analytics
© 2015 Evaluator Group, Inc. All rights reserved. Reproduction of this publication in any form without prior written permission is prohibited. 1
Enabling Converged, Real-time Analytics
2
Executive Summary
An expanding use of analytics tools that can converge and extract value from multiple data sources has fueled a growing interest in real-­‐time information delivery. Business executives are discovering that information gleaned from real-­‐time data sources—both internal and external to the enterprise—can have more relevance and value in a competitive business context than information that essentially looks backward in time. In addition, the fact that results can be had with greater speed means that more business can be generated on a daily basis vs. using traditional business intelligence system architectures. The power of new in-­‐memory data processing technologies becomes apparent when business users can be offered new real-­‐time analytics services that can be used to guide them to gain competitive advantage on an ongoing basis. This approach makes in-­‐memory independent of specific vendor-­‐centric solutions (SAP HANA, for example), alleviating the need to buy, train for, and support different ones from each vendor as needs arise. An example of this approach can be found in GridGain’s In-­‐Memory Data Fabric, a software-­‐only implementation that is available on the open source model or in an enterprise edition. Here we review the GridGain Data Fabric architectural approach as a way to extract value from multiple data sources and in real time. The Value of In-Memory Computing to the CEO
Applications built on databases have traditionally used mechanical disk to perform most of a system’s storage functions. Today, enterprise disk systems can store huge amounts of data at a relatively low cost. Because DRAM memory has traditionally been much more expensive than disk, it is used as very high-­‐
speed temporary storage for database applications. Data on disk is paged in and out of DRAM. But what if most – if not all – of the data needed for an application could be economically stored in DRAM memory? Business intelligence (BI) application users could see multiple orders of magnitude gains in performance because there would be no need for the BI system to continually read and write data to and from disk. Rather than getting reports from data gathered the previous day or week, they could get the information they need to make informed, data-­‐based decisions immediately and in real time. Storing large amounts of data in memory is now economically possible for two reasons: 1. The cost of DRAM is continually decreasing while the storage capacity and performance of DRAM modules are continually increasing. Simply stated, users can plan on buying more DRAM performance and capacity for less, both now and in the foreseeable future. 2. In-­‐Memory Data Grid technology allows DRAM modules to be “wired” together to form memory fabrics that span individual server clusters. DRAM storage for a BI application can be contiguously scaled upward in capacity, making it possible to store entire databases in DRAM. This eliminates the time-­‐consuming task of continuously paging data in and out of disk. Data is immediately available to the BI application and the business application user. © 2015 Evaluator Group, Inc. All rights reserved. Reproduction of this publication in any form without prior written permission is prohibited. Enabling Converged, Real-time Analytics
3
In the financial services industry, for example, in-­‐memory technologies have become a key enabler of real-­‐time analysis and information delivery to business users. In fact, some financial services firms now consider these technologies to be mission critical. However, simply having the ability to run in-­‐memory analytics processes is not enough. Another important factor is the ability to integrate diverse data streams in real time as well. Enterprise CEOs have been well aware for years that, to varying degrees, data drives their business models. This realization also makes them aware of the multiplicity of data sources available to them now and in the future. As early as 2004, the CEO of a major teaching hospital envisioned the real-­‐time convergence of RFID data with patient data to yield a system that would set off an alarm at a nursing station when the potential existed for a patient to be exposed to a drug that would cause an adverse reaction1. However, one of the roadblocks CEOs commonly see in leveraging these data sources is an inability to integrate them with data they already have. This was true of the hospital CEO in 2004 and it is still true today. They often see deficiencies in both technology and expertise on the part of their IT departments, which inhibits data acquisition and management once it’s acquired. While aware of the data available from many social media sources (Facebook, Twitter, etc.), they encounter barriers to integrating these data streams with their own customer data. The good news is they generally believe that access to multiple data sources is increasing—i.e. the Big Data phenomenon—where these new data sources include mobile device usage on a massive scale, social media, and the Internet of Things, and that these data sources can be integrated into and will add significant value to their business models. They are increasing their investments in Big Data projects and/or establishing new ones at an increasing rate. So for them, it’s now more simply a matter of developing an ability to tap into the new sources of data, integrating them with what they already have, and enabling decision makers to leverage them. Nevertheless, the early Big Data integration leaders are now expressing a concern that many of their competitors are doing what they were doing at the start of the phenomenon, and they now need to find new ways to secure a competitive advantage2. Therefore, leveraging multiple data sources to make decisions and support business users in real time will likely become the new business intelligence objective. In-­‐memory computing technology can fill this need. 1
2
Source: “Inescapable Data-­‐Harnessing the Power of Convergence”…. Source: http://sloanreview.mit.edu/projects/analytics-­‐mandate/ © 2015 Evaluator Group, Inc. All rights reserved. Reproduction of this publication in any form without prior written permission is prohibited. Enabling Converged, Real-time Analytics
4
GridGain’s In-Memory Data Fabric
The GridGain In-­‐Memory Data Fabric is comprehensive in-­‐memory software enabling high-­‐performance for transaction processing, real-­‐time streaming and analytics in a single, highly scalable data access and processing layer. It is designed to support both existing and new applications in a distributed, massively parallel processing environment composed of commodity hardware. Figure 1. GridGain Data Fabric Source: GridGain Systems The GridGain In-­‐Memory Data Fabric accesses and processes data from distributed enterprise and cloud-­‐
based data stores orders of magnitudes faster than traditional BI systems. There are four major aspects of the Data Fabric: Data Grid
The Data Grid allows GridGain to collocate computations with data in a way that reduces latency by storing all data both in-­‐memory and disk as opposed to disk only for primary storage. The Data Grid employs a memory-­‐first and disk-­‐second approach where memory is utilized as a primary storage for computation and disk as a secondary storage for data protection and persistence. The Data Grid scales horizontally in capacity by adding nodes on demand without disruption. Scaling to hundreds of nodes is possible. The Data Grid supports local, replicated, and partitioned data sets and allows the ability to freely cross-­‐query between these data sets using standard SQL syntax. No data movement is required, allowing IT administrators to assure business application users that they are basing decisions on a “single source of the truth.” © 2015 Evaluator Group, Inc. All rights reserved. Reproduction of this publication in any form without prior written permission is prohibited. Enabling Converged, Real-time Analytics
5
Clustering
GridGain’s clustering is based on technology that provides the ability to connect and manage a heterogeneous set of computing devices. Data consistency across nodes is maintained for clusters scaling to hundreds and even thousands of nodes. A Zero Deployment feature removes the need to deploy GridGain software components individually to the cluster. All software together with resources is deployed across the cluster automatically. The ability to automatically recover from a cluster node failure is also included. Real-time Streaming
Real-­‐time, in-­‐memory data stream processing provides both event workflow and Complex Event Processing (CEP) capabilities that are integrated with the Data Fabric. Data is queried in real time, as the cluster encounters it. GridGain implements sliding event processing windows (see below) that can be limited by size or time. Event Windows can also be defined by individual events or processed in batches and can be sorted and snapshotted for sharing and data protection. Figure 2: The Sliding Event Window in GridGain’s Real-­‐time Streaming To preserve data integrity for mission-­‐critical information, survive node crashes, and ensure that all event-­‐related data will always remain intact and consistent, GridGain allows streaming events to be stored in the Data Grid to avoid disruptions in real-­‐time processing. Hadoop Acceleration
Hadoop acceleration included in the GridGain In-­‐Memory Data Fabric features the GridGain In-­‐Memory File System (GGFS). It has been designed to work in dual mode as either a standalone primary file system in the Hadoop cluster or in tandem with HDFS, serving as an intelligent caching layer with HDFS configured as the primary file system. GridGain’s software can now be acquired in two ways: 1. GridGain’s fully functional Data Fabric source code was recently released under the Apache 2.0 license and accepted into the Apache Incubator program under the name of Apache Ignite. © 2015 Evaluator Group, Inc. All rights reserved. Reproduction of this publication in any form without prior written permission is prohibited. Enabling Converged, Real-time Analytics
6
2. An Enterprise version of the software that offers increased resilience, security and manageability vs. the Apache version can be licensed from GridGain. Business Impact
We have seen that, once a Big Data analytics application such as one enabled by the GridGain In-­‐
Memory Data Fabric goes successfully into production, others quickly follow. The reason is that when business users realize value from being able to converge and analyze large volumes of different types of data (database transactions, click streams from web sources, text from emails, GPS data from mobile devices, etc.), they want to apply this power in other ways to generate more revenue and/or solve other business problems. We have seen retailers start with an application that reveals customer buying decision patterns via an analysis of Twitter and Facebook data converged with their own individual customer transaction data (a process commonly known as “Sentiment Analysis”). Success here leads to applications that optimize retail inventory to enhance customer satisfaction, increase net income through supply chain and pricing optimization, and decrease inventory shrinkage via video data analysis. Similarly, we have seen financial services firms begin with an application that dramatically reduces credit card fraud in real by stopping a fraudulent transaction in progress and progress to applications that deliver personalized investment information to their customers via mobile devices in real time. The ability to converge different data sources in real time allows business executives to advance their objectives in two ways. First, they can take automated information processes they already have and enhance them by adding more relevant data sources and accelerating the ability to deliver more information faster to decision makers. But they can also dream-­‐up and implement completely new business models and processes based on applications that would otherwise not be possible without real time data convergence. Evaluator Group Assessment:
As a once highly visible national leader once quipped, “There are known knowns, known unknowns, and unknown unknowns.” Many CEOs are aware that there are times when they feel they don’t know what they don’t know. It’s just not a matter of getting answers to questions that are validated by hard data. Rather, they don’t know the right questions to ask of the data in the first place. The power of real-­‐time information-­‐based decision making is realized when a business executive can analyze data in real time and see patterns emerge that weren’t expected. For example, a financial services anaylst may see new and unexpected patterns emerge from market data that could represent a risk to the firm’s positions. Further querying can then lead to taking actions that avoid the risk in the shortest time possible. Similarly, a credit card services company could see what could be fraudulent card usage patterns emerge from transaction data as it is occuring. Further anaylysis could then lead to the immediate implementation of preventative measures. © 2015 Evaluator Group, Inc. All rights reserved. Reproduction of this publication in any form without prior written permission is prohibited. Enabling Converged, Real-time Analytics
7
CEOs are generally aware that their business environments are increasingly data driven. In fact, they’ve been so for years. What they’ve generally lacked is the system support. Real-­‐
time data analytics enabled by an affordable in-­‐memory computing technology such as GridGain’s can now help them realize their visions. GridGain’s In-­‐Memory Data Fabric can be deployed on general purpose commodity hardware and can integrate data from multiple internal and external sources. In addition, the ability to analyze streaming data in real time enables a degree of business control not possible using traditional business intelligence systems. About Evaluator Group
Evaluator Group Inc. is dedicated to helping IT professionals and vendors create and implement strategies that make the most of the value of their storage and digital information. Evaluator Group services deliver in-­‐depth, unbiased analysis on storage architectures, infrastructures and management for IT professionals. Since 1997 Evaluator Group has provided services for thousands of end users and vendor professionals through product and market evaluations, competitive analysis and education. www.evaluatorgroup.com Follow us on Twitter @evaluator_group Copyright 2015 Evaluator Group, Inc. All rights reserved.
No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying and recording, or stored in a database or retrieval system for any purpose without the express written consent of Evaluator Group Inc. The information contained in this document is subject to change without notice. Evaluator Group assumes no responsibility for errors or omissions. Evaluator Group makes no expressed or implied warranties in this document relating to the use or operation of the products described herein. In no event shall Evaluator Group be liable for any indirect, special, inconsequential or incidental damages arising out of or associated with any aspect of this publication, even if advised of the possibility of such damages. The Evaluator Series is a trademark of Evaluator Group, Inc. All other trademarks are the property of their respective companies. © 2015 Evaluator Group, Inc. All rights reserved. Reproduction of this publication in any form without prior written permission is prohibited.