Seven Steps to Faster Analytics Processing with Open Source

TDWI CHECKLIST REPORT: SE VEN STEPS TO FASTER ANALY TICS PROCESSING WITH OPEN SOURCE
TDWI RESEARCH
TDWI CHECKLIST REPORT
Seven Steps to Faster
Analytics Processing
with Open Source
Realizing Business Value from
the Hadoop Ecosystem
By David Stodder
Sponsored by:
tdwi.org
c1 TDWI RESE A RCH
tdwi.org
DECEMBER 2015
T DW I CHECK L IS T RE P OR T
SEVEN STEPS TO FASTER
ANALYTICS PROCESSING WITH
OPEN SOURCE
REALIZING BUSINESS
VALUE FROM THE HADOOP
ECOSYSTEM
By David Stodder
TABLE OF CONTENTS
2 FOREWORD
2 NUMBER ONE
Take advantage of open source innovations to push
analytics to the next level
3 NUMBER TWO
Overcome limitations of slow, hard-to-develop batch
processing
3 NUMBER THREE
Gain business value from open source stream
processing and stream analytics
4 NUMBER FOUR
Choose the right strategy for supporting interactive
access to Hadoop data
4 NUMBER FIVE
Evaluate technologies for improving data integration and
preparation for big data analytics
5 NUMBER SIX
Accelerate the business impact of processing and
analytics with open standard technologies
5 NUMBER SEVEN
Balance the value of unified architecture with the
benefits of best-of-breed innovation
6 ABOUT OUR SPONSORS
8 ABOUT THE AUTHOR
8 ABOUT TDWI RESEARCH
8 ABOUT TDWI CHECKLIST REPORTS
555 S Renton Village Place, Ste. 700
Renton, WA 98057-3295
T
F
E
425.277.9126
425.687.2842
[email protected]
tdwi.org
1 TDWI RESE A RCH
© 2015 by TDWI (The Data Warehousing InstituteTM), a division of 1105 Media, Inc. All rights
reserved. Reproductions in whole or in part are prohibited except by written permission. E-mail
requests or feedback to [email protected]. Product and company names mentioned herein may be
trademarks and/or registered trademarks of their respective companies.
tdwi.org
TDWI CHECKLIST REPORT: SE VEN STEPS TO FASTER ANALY TICS PROCESSING WITH OPEN SOURCE
FOREWORD
NUMBER ONE
TAKE ADVANTAGE OF OPEN SOURCE INNOVATIONS TO
PUSH ANALYTICS TO THE NEXT LEVEL
Excellence in analytics is a competitive advantage in nearly all
industries. Technology trends are moving in a positive direction for
organizations seeking to expand the business impact of analytics.
New technologies can help organizations democratize the analytics
experience so that more managers, operational employees, and other
users can engage in faster data-driven decision making.
On the front end, visual analytics and data discovery tools are
enabling users to move beyond limited and static data views typical
of traditional business intelligence (BI) reporting and spreadsheet
applications. Nontechnical users and developers, data engineers,
and data scientists working with advanced tools and techniques are
pushing to get past traditional BI and data warehousing barriers
to access and to interact with a broader range of data, including
real-time data streams. They have little time to wait for long extract,
transform, and load (ETL) or other preparation processes to complete
before they can touch the data. Their urgency is driving innovation
toward faster, easier, and more flexible data integration, preparation,
and processing.
Open source projects are spawning many key innovations. The Hadoop
ecosystem, which developed out of a series of ongoing Apache
Software Foundation projects, is maturing and expanding. Hadoop
ecosystem technologies could supplement or supplant many traditional
BI and data warehousing technologies and practices that may have
worked for classic BI querying and reporting but struggle in the brave
new world of highly iterative analytics, where questions often lead to
follow-up questions as part of data discovery. Old ways also struggle
when business-critical analytics processes require ready access to big
data and need better performance and flexibility to support the variety
of analytic models.
Organizations should evaluate Hadoop ecosystem technologies for
how they can contribute to giving users easier, more interactive,
and more integrated experiences with data. They should examine
how open source technologies and frameworks can reduce delays
in preparing and processing data for users, developers, and data
scientists who are seeking to employ advanced analytics. This
Checklist will discuss seven key considerations to help organizations
focus their evaluation and develop a strategy for gaining value from
open source technologies to support faster, more powerful, and more
flexible analytics.
2 TDWI RESE A RCH
Core Apache Hadoop, consisting of the Hadoop Distributed File
System (HDFS) and MapReduce, were developed over a decade ago to
enable analytics that was not possible—or at best, was cumbersome
and expensive—with traditional relational data warehouses. Their
developers needed software and data processing capabilities that
would allow them to engage in deeper and more computationally
intensive analytics with huge data sets drawn from primarily multistructured information on the Web and user search behavior. Hadoop
has since been used by many companies to analyze data generated by
customer behavior, interpret sensor and machine data, and other use
cases that could not easily be addressed by using traditional systems.
Hadoop was designed to take advantage of affordable, commodity
servers that could be clustered to support distributed, massively
parallel processing of large data sets. MapReduce, while not easy for
developers to use, enabled organizations to process data inside the
clusters themselves rather than dealing with the bottleneck of “moving
data to the code” (i.e., across networks to application servers or other
platforms for processing). HDFS could facilitate rapid data transfer
among the nodes in a cluster, ensure resiliency if a node failed, and
with MapReduce distribute the work across the nodes and assemble
(i.e., “reduce”) the results. A key benefit for analytics has been the
ability to run processes directly on integrated volumes of raw, highly
varied data, not just on a set of disparate samples or aggregates.
Starting with the initial related open source Apache projects, such
as Hive, HBase, and Zookeeper, a thriving Hadoop ecosystem has
grown up with tools and frameworks for different types of storage,
processing, data integration, resource management, security,
analytics, search, and data discovery. The ecosystem continues to
expand with the introduction of both new and revised technologies.
Hadoop 2.0’s YARN (Yet Another Resource Negotiator), for example,
provides a new layer between HDFS and MapReduce for better cluster
resource management and a flexible execution model for programming
frameworks other than MapReduce.
Organizations should ensure that their developers have current
knowledge about new and evolving open source concepts, frameworks,
and technologies that could contribute to objectives with analytics, BI,
and data discovery.
tdwi.org
TDWI CHECKLIST REPORT: SE VEN STEPS TO FASTER ANALY TICS PROCESSING WITH OPEN SOURCE
NUMBER TWO
NUMBER THREE
OVERCOME LIMITATIONS OF SLOW, HARD-TO-DEVELOP
BATCH PROCESSING
GAIN BUSINESS VALUE FROM OPEN SOURCE STREAM
PROCESSING AND STREAM ANALYTICS
Many organizations today are focused on driving out latency—that
is, slowness in gaining insight from analytics due to processing
delays that has a detrimental impact on business performance. The
ultimate zero latency is real-time processing; at the other end of
the spectrum is batch processing. Traditionally, IT will schedule BI
updates, analytic queries, and extract, transform, and load (ETL)
jobs to run without manual intervention in off hours so that dataand compute-intensive processes are not competing with online
applications. Batch is still often the best way of performing bulk
updates, scans, ETL, and other activities against large data volumes
where the aim is repeatable results drawn from unchanging data.
However, batch windows are getting tighter.
Many firms want to tap real-time data streams generated by such
sources as e-commerce, social media, mobile devices, and Internet
of Things (IoT) sources such as sensors and machine data. In many
cases, the requirement is to stream data as it is generated straight
into Hadoop files. However, other firms want to apply real-time
“stream” analytics to “data in motion” before—or instead of—
landing it in Hadoop or other system files so they can analyze and
act on it automatically in real time.
Once batch processes are underway, developers and administrators
have to wait until they finish before they can give users access to
the resulting snapshot of historical data. No one can interact with
the data through iterative queries in a continuous fashion, which is
necessary for many kinds of analytics. Everyone has to wait for the
results of batch processes, and if they have further queries, wait
again for subsequent batch processes to finish.
HDFS and MapReduce provided new means of storing and
processing data but did not offer an alternative to batch
processing. Likewise, the Hive SQL interface to Hadoop also works
in batch (and until recently, only on MapReduce) to perform reads
and writes of historical, disk-based data. Major areas of focus
for the next generation of the Hadoop ecosystem have been to
enable organizations to perform faster batch processing, use
in-memory computing to allow more continuous data interaction
despite batch cycles, and support multi-step processing and
multi-pass computations and algorithms that do more work faster.
Apache Spark, Apache Apex, and other technologies support these
innovations and are gaining strength as flexible alternatives to
MapReduce that deliver more efficient batch processes.
Organizations should improve batch processing to increase end-user
data access and satisfy analytic workloads. Developers should be
encouraged to use newer techniques such as pipeline processing,
which can speed data engineering because jobs can run in parallel
and not wait for completion of entire processes before work on
single processes can begin.
3 TDWI RESE A RCH
Stream analytics is about applying statistical models, algorithms, or
other analytics practices to data that arrives continuously, often as
an unbounded sequence of instances. By running predictive models
on these flows, organizations can monitor events and processes and
become more situationally aware. Organizations can be proactive in
spotting trends, patterns, correlations, and anomalous events and
respond as they are happening. They can filter data for relevance
and enrich the quality of data flowing in real time with information
from their other sources.
The Hadoop ecosystem has been generating open source projects
relevant to stream processing and stream analytics. The Apache
Spark Streaming module, Apache Storm, and Apache Apex are aimed
at processing streams of real-time data for analytics. Apache Spark
is a general-purpose data processing engine with an extensible
application programming interface for different workloads, including
Spark Streaming for stream processing. Storm, though more mature,
may eventually give way to Heron, which Storm’s original developer,
Twitter, is implementing as a replacement. Apache Apex, developed
by DataTorrent and accepted in September 2015 as a project by
the Apache Incubator, unifies stream and batch processing. These
technologies can be integrated with Apache Kafka, a publish-andsubscribe messaging system that can serve data to streaming
systems. Kafka is growing in popularity for meeting high throughput,
horizontal scalability requirements.
Organizations should evaluate open source technologies that support
stream processing and analytics. Organizations engaged in tracking
and monitoring activities—or seeking to perform analytics on “fast”
data to react in real time—should make establishing a technology
strategy for stream processing and stream analytics a priority.
tdwi.org
TDWI CHECKLIST REPORT: SE VEN STEPS TO FASTER ANALY TICS PROCESSING WITH OPEN SOURCE
NUMBER FOUR
CHOOSE THE RIGHT STRATEGY FOR SUPPORTING
INTERACTIVE ACCESS TO HADOOP DATA
Armed with self-service BI and visual analytics tools, nontechnical
users are ready to join data scientists in reaching beyond data
warehousing systems to interact with data stored in Hadoop files.
Technology trends have begun to make this easier by offering a
variety of means to interact with Hadoop data, including directly
rather than by first loading the data into a relational database.
Interactivity typically includes the ability to send ad hoc SQL
queries to the data via front-end tools, either by using execution
engines and intermediate databases or directly to Hadoop files.
Response time for interactive queries should be significantly
faster than with batch processes, keeping in mind that interactive
querying of Hadoop data is primarily for supporting discovery.
Users will most likely not be interacting with true real-time data;
thus, it is important for users to know the freshness (and quality)
of the data. A key issue to also examine is the degree to which the
technology solution supports standard ANSI SQL and how extensions
or customized functions are supported.
For developers, Hive was the first generation, offering functionality
that converted SQL into MapReduce jobs that would run in batch.
The Apache Tez project, built on YARN, is aimed at offering a better
execution engine for Hive (or Pig) than MapReduce offered. However,
an alternative trend is to integrate Hive (or other frameworks such
as Pig and Crunch) with Apache Spark. Existing Hive jobs can then
run on Spark, which handles the processing of data; developers are
able to choose their preferred execution engine. Another alternative
is the Spark SQL module, which exposes Spark-only data sets via
JDBC to BI and visual analytics tools.
SQL-on-Hadoop is a different option for faster, near-real-time
interactivity and for situations that demand high concurrency. This
approach allows developers—or users through front-end tools—to
query Hadoop data directly through the SQL-on-Hadoop software’s
own parallel query execution engine. Cloudera Impala, Apache Drill,
Presto, and other technologies are implementing this approach.
Open source technologies are maturing to enable interactive
querying and iterative data exploration from BI and visual analytics
tools. Organizations should evaluate options given their current and
anticipated user demand.
4 TDWI RESE A RCH
NUMBER FIVE
EVALUATE TECHNOLOGIES FOR IMPROVING DATA
INTEGRATION AND PREPARATION FOR BIG DATA
ANALYTICS
Data integration, ETL, and other data preparation steps can be timeconsuming parts of BI and analytics projects. Often because of poor
data quality, business analysts and data scientists have to spend
the majority of their time manually preparing data. Nontechnical
users of visual analytics and data discovery tools are also frustrated
because they do not have the technology or knowledge to complete
data preparation steps on their own. The problem is growing acute
in the age of big data as organizations amass volumes of multistructured data, often in Hadoop files, and want to blend it with
traditional data to gain comprehensive views.
Open source technologies are generating new ways of addressing
data integration, ETL, and data preparation. With so much data
being stored and processed in Hadoop files—now joined by data
sets processed by Spark, Apex, and other technologies—it follows
that many organizations would seek to move tasks close to where
the data is stored and processed. Being able to run processing and
analytics on the same data in the same platform can eliminate data
movement, cut down metadata confusion, and help lower overall
costs. Talend, for example, can run natively in Hadoop and has
introduced a data integration platform built on Spark. Technologies
such as Platfora are able to use Spark to drive and consolidate data
preparation and transformation in support of a range of analytics.
As organizations bring in big data from heterogeneous and
sometimes unstable edge sites, they must ensure a continuous
data flow. The incubating Apache project NiFi, originally developed
by the National Security Administration, is focused on automating
the flow of data between external devices and the data center. NiFi
can enable firms to track pieces of data from entry until landing in
the data center. This can be important for data governance and for
understanding the data lineage behind analytics.
Newer technologies are beginning to embed advanced analytics
for faster integration, transformation, and enrichment data from
Hadoop files as well as from traditional sources. Organizations
should evaluate how these technologies can help them reduce errors
and waste in data integration, transformation, and preparation
processes on big data.
tdwi.org
TDWI CHECKLIST REPORT: SE VEN STEPS TO FASTER ANALY TICS PROCESSING WITH OPEN SOURCE
NUMBER SIX
NUMBER SEVEN
ACCELERATE THE BUSINESS IMPACT OF PROCESSING
AND ANALYTICS WITH OPEN STANDARD TECHNOLOGIES
BALANCE THE VALUE OF UNIFIED ARCHITECTURE WITH
THE BENEFITS OF BEST-OF-BREED INNOVATION
Open source technologies that follow open, accepted standards
are helping spread innovation in analytics. Development efforts
can be repeatable, reusable, and optimized with less custom work.
Standards are also important in easing integration with applications
and systems developed by IT with commercial technologies.
As advanced analytics spreads, it will be important to enable
developers and data scientists to work with flexible tools and
frameworks as well as those that follow open standards.
Having an integrated and well-governed data architecture and
set of technologies is important. The architecture must have a
wide scope to include both on-premises and cloud systems, and
it must knit together traditional BI and data warehouses with the
Hadoop ecosystem. A unified architecture must also be integrated
with business processes so that analytics can directly improve
processes. However, as important as integration and unification
are, these efforts must be balanced with the need to let developers,
data scientists, and business users experiment with innovative
technologies and choose the right tool.
Advances in open source for stream analytics were discussed
previously. Another type of analytics spreading fast due to standards
is machine learning—a term coined in 1959 by Arthur Samuel, a
pioneer in artificial intelligence, to mean “a field of study that gives
computers the ability to learn without being explicitly programmed.”
Provided with data in either a supervised (by a human) or
unsupervised way, a machine can learn from examples. Machine
learning algorithms are a growing presence for discovering patterns,
predictive analytics, and more. Machines can use collaborative
filtering of customers’ past behavior, for example, to create
actionable predictive insights for automating the order in which
recommended products are presented.
There are several open source projects underway focused on
machine learning, some of which have been in existence for
some time, such as Shogun, which was created in 1999. Apache
Mahout, first released in 2008, is aimed at giving developers the
ability to create scalable machine learning applications primarily
for implementation on top of Hadoop and MapReduce but more
recently on Spark, which has a module called MLlib. Each has
strengths as well as gaps in the types of use cases they can
support and how they work with R, Python, and other programming
languages for analytics.
Online analytical processing (OLAP) could also benefit from open
source. Though not yet mature enough to be considered standards,
prominent projects include Druid, which can ingest data and events
in real time into a data store for exploration, time-series analysis,
and other OLAP processes. Apache incubator project Kylin, originally
developed at eBay, is aimed at providing a distributed analytics
engine that supports SQL and multidimensional OLAP on Hadoop.
Organizations should evaluate emerging advanced analytics and
OLAP technologies for their potential to give business functions a
competitive edge.
5 TDWI RESE A RCH
Industry analysts describe unified environments as “logical” or
“hybrid” architectures. The notion is to have a cohesive platform
that enables multiple frameworks and workloads for users with
different skills as well as the range of potential use cases. Of
course, few if any organizations have reached such a level of unity
and have still been able to balance that unification with flexibility.
More typically, IT management establishes a centralized architecture
and then has to grapple with “shadow” systems run by business
units that threaten the unity. Some organizations shut out open
source technologies, particularly if used by shadow systems, out of
fear that they will exacerbate disunity.
Historically, the Hadoop ecosystem has indeed been a collection
of disparate tools and technologies. Hadoop 2.0 technologies are
beginning to steer the ecosystem toward greater integration. YARN,
for example, supports multiple types of processing from multiple
sources and can therefore enable best-of-breed technologies to
work within a unified framework. Organizations can choose different
analytics, data integration and preparation, data processing, and
storage technologies but still plug them into the ecosystem thanks
to YARN. However, most organizations still need to work with
vendors’ platforms to gain fuller integration and management of
open-source-based technologies.
Integration and unification are vital, in particular to support users
of self-service BI and visual analytics tools who are not seasoned
developers and do not have the knowledge, time, or interest in
getting their hands dirty with integrating technologies. Organizations
should evaluate vendors’ multifunction platforms and systems to
ensure they not only match requirements but that they also can be
updated as new technologies emerge.
tdwi.org
TDWI CHECKLIST REPORT: SE VEN STEPS TO FASTER ANALY TICS PROCESSING WITH OPEN SOURCE
ABOUT OUR SPONSORS
cloudera.com
www.datatorrent.com
Cloudera is revolutionizing enterprise data management by offering
the first unified Platform for Big Data, an enterprise data hub built
on Apache Hadoop. Cloudera offers enterprises one place to store,
process, and analyze all their data, empowering them to extend the
value of existing investments while enabling fundamental new ways
to derive value from their data.
DataTorrent is the leader in real-time big data analytics and the
creator and sponsor of Apache Apex, an open source, enterprisegrade native YARN big data platform that unifies stream processing
and batch processing. Apex processes big data in motion in a
highly scalable, highly performant, fault-tolerant, stateful, secure,
distributed, and easily operable way. Apex provides a simple API that
enables users to write or reuse generic Java code, thereby lowering
the expertise needed to write big data applications. Apache Apex is
currently in incubation status.
Only Cloudera offers everything needed on a journey to an
enterprise data hub, including software for business-critical
data challenges such as storage, access, management, analysis,
security, and search. As the leading educator of Hadoop
professionals, Cloudera has trained over 22,000 individuals
worldwide. Over 1,000 partners and a seasoned professional
services team help deliver greater time to value.
Finally, only Cloudera provides proactive and predictive support to
run an enterprise data hub with confidence. Leading organizations
in every industry plus top public sector organizations globally run
Cloudera in production.
6 TDWI RESE A RCH
DataTorrent RTS, built on the foundation of Apache Apex, is the
industry’s only enterprise-grade open source, unified stream and
batch processing platform.
DataTorrent dtIngest simplifies the collection, aggregation, and
movement of large amounts of data to and from Hadoop for a more
efficient data processing pipeline and is available to organizations
for unlimited use at no cost.
DataTorrent is proven in production environments to reduce time
to market, development costs, and operational expenditures for
Fortune 100 and leading Internet companies. Based in Santa Clara,
California, DataTorrent is backed by leading investors including
August Capital, GE Ventures, Singtel Innov8, Morado Ventures, and
Yahoo co-founder Jerry Yang. For more information, visit our website
(www.datatorrent.com) or follow us on Twitter (www.twitter.com/
datatorrent).
tdwi.org
TDWI CHECKLIST REPORT: SE VEN STEPS TO FASTER ANALY TICS PROCESSING WITH OPEN SOURCE
ABOUT OUR SPONSORS
www.platfora.com
Talend.com
Platfora is the fastest, most powerful, flexible, and complete Big
Data Discovery platform built natively on Apache Hadoop and Spark.
Platfora enables business users and data scientists alike to visually
interact with petabyte-scale data in seconds, allowing them to
work with even the rawest forms of data. The latest update to the
platform provides expanded support for SQL, Excel, and Apache
Spark, creating a more open workflow that lets users seamlessly
connect to the most popular data tools. Platfora’s next-generation
data prep provides instant statistics and sample data to better
guide users toward smart, customized data-driven decisions and
facilitates more intelligent, iterative investigations. With Platfora,
data insights can be shared across the organization without silos,
driving collaboration on even the most complex data queries.
At Talend, it’s our mission to connect the data-driven enterprise, so
our customers can operate in real time with new insight about their
customers, markets, and business.
Platfora is transforming the way businesses unlock insights from
big data to achieve more meaningful outcomes through the use of
its industry-defining Customer Analytics, Security Analytics, and
Internet of Things solutions. Data-driven organizations use Platfora
Big Data Discovery to tightly integrate analytics workflows with the
most in-demand features, including advanced visualizations, selfservice data preparation, UI and data transforms, drag-and-drop
data sets, and machine learning.
7 TDWI RESE A RCH
Founded in 2006, our global team of integration experts builds
on open source innovation to create enterprise-ready solutions
that help unlock business value more quickly. By design, Talend
integration software simplifies the development process, reduces
the learning curve, and decreases total cost of ownership with a
unified, open, and predictable platform. Through native support
of modern big data platforms, Talend takes the complexity out of
integration efforts.
More than 1,700 enterprise customers worldwide rely on Talend’s
solutions and services. Privately held and headquartered in Redwood
City, California, the company has offices in North America, Europe,
and Asia, along with a global network of technical and services
partners. For more information, please visit www.talend.com and
follow us on Twitter: @Talend.
tdwi.org
TDWI CHECKLIST REPORT: SE VEN STEPS TO FASTER ANALY TICS PROCESSING WITH OPEN SOURCE
ABOUT THE AUTHOR
David Stodder is director of TDWI Research for business intelligence.
He focuses on providing research-based insight and best practices
for organizations implementing business intelligence (BI), analytics,
performance management, data discovery, data visualization, and
related technologies and methods. He is the author of TDWI Best
Practices Reports on mobile BI and customer analytics in the age of
social media, as well as TDWI Checklist Reports on data discovery
and information management. He has chaired TDWI conferences on BI
agility and big data analytics. Stodder has provided thought leadership
on BI, information management, and IT management for over two
decades. He has served as vice president and research director with
Ventana Research, and he was the founding chief editor of Intelligent
Enterprise, where he served as editorial director for nine years. You can
reach him by e-mail ([email protected]), on Twitter (www.twitter.com/
dbstodder), and on LinkedIn (www.linkedin.com/in/davidstodder).
ABOUT TDWI RESEARCH
TDWI Research provides research and advice for data professionals
worldwide. TDWI Research focuses exclusively on business
intelligence, data warehousing, and analytics issues and teams
up with industry thought leaders and practitioners to deliver both
broad and deep understanding of the business and technical
challenges surrounding the deployment and use of business
intelligence, data warehousing, and analytics solutions. TDWI
Research offers in-depth research reports, commentary, inquiry
services, and topical conferences as well as strategic planning
services to user and vendor organizations.
ABOUT TDWI CHECKLIST REPORTS
TDWI Checklist Reports provide an overview of success factors for
a specific project in business intelligence, data warehousing, or
a related data management discipline. Companies may use this
overview to get organized before beginning a project or to identify
goals and areas of improvement for current projects.
8 TDWI RESE A RCH
tdwi.org