IBM - Dama-NY

Cliff Candiotti – [email protected]
May 2017
The Hybrid Data Integration Platform
Integrating Structured and Unstructured Data
© 2017 IBM Corporation
Disclaimer :
 IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal
without notice at IBM’s sole discretion.
 Information regarding potential future products is intended to outline our general product
direction and it should not be relied on in making a purchasing decision.
 The information mentioned regarding potential future products is not a commitment,
promise, or legal obligation to deliver any material, code or functionality. Information
about potential future products may not be incorporated into any contract.
 The development, release, and timing of any future features or functionality described for our
products remains at our sole discretion.
Performance is based on measurements and projections using standard IBM benchmarks in a
controlled environment. The actual throughput or performance that any user will experience will
vary depending upon many factors, including considerations such as the amount of
multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and
the workload processed. Therefore, no assurance can be given that an individual user will achieve
results similar to those stated here.
2
2
© 2017 IBM Corporation
Legal Disclaimer
• © IBM Corporation 2017. All Rights Reserved.
• The information contained in this publication is provided for informational purposes only. While efforts were made to verify the completeness and accuracy of the information contained
in this publication, it is provided AS IS without warranty of any kind, express or implied. In addition, this information is based on IBM’s current product plans and strategy, which are
subject to change by IBM without notice. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this publication or any other materials. Nothing
contained in this publication is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and
conditions of the applicable license agreement governing the use of IBM software.
• References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. Product release dates and/or
capabilities referenced in this presentation may change at any time at IBM’s sole discretion based on market opportunities or other factors, and are not intended to be a commitment
to future product or feature availability in any way. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by
you will result in any specific sales, revenue growth or other results.
• If the text contains performance statistics or references to benchmarks, insert the following language; otherwise delete:
Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will
experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage
configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
• If the text includes any customer examples, please confirm we have prior written approval from such customer and insert the following language; otherwise delete:
All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs
and performance characteristics may vary by customer.
• Please review text for proper trademark attribution of IBM products. At first use, each product name must be the full name and include appropriate trademark symbols (e.g., IBM
Lotus® Sametime® Unyte™). Subsequent references can drop “IBM” but should include the proper branding (e.g., Lotus Sametime Gateway, or WebSphere Application Server).
Please refer to http://www.ibm.com/legal/copytrade.shtml for guidance on which trademarks require the ® or ™ symbol. Do not use abbreviations for IBM product names in your
presentation. All product names must be used as adjectives rather than nouns. Please list all of the trademarks that you use in your presentation as follows; delete any not included in
your presentation. IBM, the IBM logo, Lotus, Lotus Notes, Notes, Domino, Quickr, Sametime, WebSphere, UC2, PartnerWorld and Lotusphere are trademarks of International
Business Machines Corporation in the United States, other countries, or both. Unyte is a trademark of WebDialogs, Inc., in the United States, other countries, or both.
• If you reference Adobe® in the text, please mark the first use and include the following; otherwise delete:
Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other
countries.
• If you reference Java™ in the text, please mark the first use and include the following; otherwise delete:
Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates.
• If you reference Microsoft® and/or Windows® in the text, please mark the first use and include the following, as applicable; otherwise delete:
Microsoft and Windows are trademarks of Microsoft Corporation in the United States, other countries, or both.
• If you reference Intel® and/or any of the following Intel products in the text, please mark the first use and include those that you use as follows; otherwise delete:
Intel, Intel Centrino, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States
and other countries.
• If you reference UNIX® in the text, please mark the first use and include the following; otherwise delete:
UNIX is a registered trademark of The Open Group in the United States and other countries.
• If you reference Linux® in your presentation, please mark the first use and include the following; otherwise delete:
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of
others.
• If the text/graphics include screenshots, no actual IBM employee names may be used (even your own), if your screenshots include fictitious company names (e.g., Renovations, Zeta
Bank, Acme) please update and insert the following; otherwise delete: All references to [insert fictitious company name] refer to a fictitious company and are used for illustration
purposes only.
3
© 2017 IBM Corporation
Users Need Data that is Accurate, Reliable and Trustworthy
Data Engineer
Public
Enterprise
Data
Web/Mobile
Data
Social
IoT
4
Data Scientist
Developer
Business Analyst
Data Steward
Businesses are challenged to answer key questions about the
integrity of their data:







Where is my data?
Who has access to this data?
Am I protecting sensitive data?
Do I have the right data and context?
How do I move and transform complex datasets?
Do I understand the risk of using this data “as is”?
How do I make data more readily available to my consumers?
© 2017 IBM Corporation
Data Integration And Governance Are Key
Data Steward
Business Analyst
Integrate
Trust
Discover, integrate
and transform data
from all types of
sources
Establish an accurate
single view of data
from various systems
Govern
Data Scientist
Data Engineer
Put data in context and
mitigate risk with unified
governance capabilities
5
© 2017 IBM Corporation
Information Empowerment for your Data Ecosystem
Data
Governance
.. powered by Information Server
InfoSphere
Information
Server
Integrating and transforming data and content to deliver
accurate, consistent, timely and complete information on a
single platform unified by a common metadata layer
Data
Quality
Data
Governance
Understand & Collaborate
• Catalog technical metadata &
align w/ business language
• Manage (big) data lineage
Compliance reporting
Data
Quality
Cleanse & Monitor
• Analyze & validate
with enhanced classification
• Cleanse & standardize
• Define, manage & monitor
data rules + exceptions
Data
Integration
Data
Integration
Transform & Deliver
• Massive scalability
• Power for any complexity
• Deliver in batch and/or realtime with change capture
• common connectivity • shared metadata • security (data privacy functions included)
• common execution engine with flexible deployments (native on YARN)
6
© 2017 IBM Corporation
IBM’s Hybrid Data Integration Platform - Scalability
Performance
Runtime engine providing unlimited scalability through all objects tasks
in batch/real-time, ETL/ELT/DV/SOA
 Maximize Resource utilization with “Anywhere” Execution
 Optimize your Integration/Transformation and Data Quality workload based on
data locality and resources availability
 Design your integration, data preparation or cleansing once and run it on your
Hadoop Cluster, on your traditional engine or optimize to run on your
database
7
© 2017 IBM Corporation
IBM’s Hybrid Data Integration Platform - Scalability
Performance
Runtime engine providing unlimited scalability through all objects tasks
in batch/real-time, ETL/ELT/DV/SOA
 Massive scalability needs an MPP shared nothing Architecture
 Dynamic
 Instantly get better
performance as hardware
resources are added
 Extendable
 Add new compute nodes to
dynamically scale out
 Partitioned
 No contention or upper
limitation on throughput
8
© 2017 IBM Corporation
IBM’s Hybrid Data Integration Platform - Scalability
Performance
Runtime engine providing unlimited scalability through all objects tasks
in batch/real-time, ETL/ELT/DV/SOA
 Proven to scale to large volumes
9
Global Bank
Data Services Co
Process 500,000 tps
with complex
transformation and
guaranteed delivery
Information Server powered
grid processing over 80+
trillion records each month
Global Retailer
Health Care
Daily processing of one
trillion rows of data
200,000 programs built in
Information Server on a
grid/cluster of low
commodity hardware.
© 2017 IBM Corporation
IBM’s Hybrid Data Integration Platform – Capabilities
Transformation
Extensive set of pre-built objects that act on data to satisfy both
simple & complex data integration tasks
 Transformation Features for Big Data
 Integration of Quality &
Transformation Components
 Leverages power of Hadoop or a parallel
dbms engine for any transformation
execution
 Easily extensive library of pre-build logic
constructs for:
• Simple & Complex integration task
• Hierarchical and relational transformations
• Warehouse-specific features
10
© 2017 IBM Corporation
IBM’s Hybrid Data Integration Platform – Extensibility
Connectivity
Native access to common industry databases and applications
for both Structured and Unstructured sources and targets
 Connectivity Features for Big Data
 Scalable Hadoop File System Connectivity to read and
write to Hadoop in parallel
 Direct feed to InfoSphere Streams for real-time
analytical processing
 Open Source Accelerators for other Big Data/NoSQL
stores…Hive, MongoDB, Cassandra, Cloudant, CouchDB
 Source & Targets for the entire Enterprise
 High performance native support for DBMS (DB2, NZ,
Oracle, Teradata, SQL Server, etc…)
 Access to zOS sources (DB2 z/OS, VSAM, ISAM, etc…)
 Specific ERP connectors
 Flexible connectivity via ODBC & JDBC
 Flat & Hierarchical formats (Flat, Cobol, XML, native Excel
 Real time: SOA, Stream, CDC
 Out of the box connectivity to ERP systems
 Extensible: Cloud, Java, External Source/Targets, etc…
11
© 2017 IBM Corporation
IBM’s Hybrid Data Integration Platform - Accountability
Built-in Governance
Maximizes business & IT collaboration providing business terms, policies,
end to end data linage, advanced impact analysis etc.
 Information Server is the only platform with true built-in Enterprise
level governance
 Common metadata paradigm for all enterprise data
• Comprehensive and extensible





12
Data lineage and impact analysis for any activity (including Hadoop workloads)
Metadata support for HDFS and NoSQL data stores
Connected & Linked semantics
Controlled & auditable from any source
Rich Representation (business, design, technical, operational metadata)
© 2017 IBM Corporation
IBM’s Hybrid Data Integration Platform – Trusted Data
Integrated Data Quality
Single user experience for data integration as well as designing & running data validation,
standardization & matching rules
Discover
Assess
Cleanse
Discovery of Business
entities across
heterogeneous sources
Data Classification &
Validation Rules
linked to business rules
for Impact Analysis
Business-driven Data
Standardization and
Matching
Validate
Monitor & Remediate
Life Cycle Governance
Rule-based data
validation to ensure
complete and
consistent data
13
Enterprise-wide DQ
Exception Monitoring and
collaborative remediation
Ownership and
management of Policies
and Rules
© 2017 IBM Corporation
What about tomorrow, next month, next year ?
Is more technology needed to get value from data?
14
© 2017 IBM Corporation
Have the Right Data at the Right Time
 Users don't have access to the
right data at the right time
 Creates a strain on IT to meet
LOB needs
 More time integrating and cleansing
 Less time interpreting and analyzing
Prevents timely and
accurate decisions
 Self-service access
and capabilities
 Provides data users with the
necessary access to manage data on
their terms
 Empowers both IT and LOB to focus
on high priority projects
15
© 2017 IBM Corporation
Anyone can do it
• Zero code
• Zero configuration
• Data driven or fully
guided flow driven
• Semi-technical person in a
business environment
• Friendly modern style
graphical design
• instant impact view
• closely guided design
Automator
Business
Analyst
Integrator
Data
Analyst
Data
Shadow
Scientist Integrator
Data-driven/ fully
guided self-service
16
• Skilled integration
practitioners
• Graphical assist, but full code
environment.
• Hybrid deployment
• Connect to anything
Data
Engineer
Developer
API
Integration
Full Stack Front End
Specialist Developer Developer Developer
Graphical flow
design with full control
Service & API
orchestration
© 2017 IBM Corporation
Business Agility demands Expansion of Integration
User Community
Data Preparation &
Curation
Self-Service
Integration
Enterprise-class
Integration
WHO:
 Business users / Data owners
non-technical users
WHO:
 Shadow IT, LOB users, Data
Scientist, semi-technical
WHO:
 Integration Specialist, Integration
Developer, highly technical
WHAT:
 Visual data shaping/curation
 WYSIWG
 Closely guided and controlled
(shop for data paradigm)
 Manipulation of 1-2 data sets
at a time
WHAT:
 Combined visual & flow based
design
 Template / pattern approach
 Zero configuration
 Implicit validation
 Collaboration
WHAT:
 Comprehensive library of
integration, transformation & quality
operations
 Support for comprehensive
integration flows and projects
 Expandability for custom operations
 Full control for configuration &
parameterization
 Top-down or bottom-up design
approach
 Support for team development
process
17
© 2017 IBM Corporation
17
Bluemix Data Connect
Data Integration through Cloud Services
Data Connect will provide the self-service data preparation, integration and
governance for the Watson Data Platform
18
© 2017 IBM Corporation
IBM’s Hosted Data Integration Solutions
Data Quality
Data Integration
Information Server on Cloud Enterprise Edition
DataStage on Cloud
DataStage on Cloud Designer Client
Information Server on Cloud Data Quality
Information Governance Catalog on Cloud
Now available as
19
Hosted services
Data Governance
© 2017 IBM Corporation
IBM’s cloud-first strategy supports hybrid environments
and supports customers in their migration to the cloud
Cloud-First Statement of Direction and Design Principles
Cloud-native
Core
Hosted Cloud
Competitive cloudnative fully managed
services…
Convenience without
compromising power
and control…
Retain our market
leadership and support
our customers…
•
•
•
• DataStage on Cloud
• DataStage Designer Client on
Cloud
• Information Server on Cloud
Data Quality
• Information Server on Cloud
Enterprise Edition
• Information Governance
Catalog on Cloud
•
•
•
20
Bluemix Data Connect
Bluemix Lift
Bluemix Data Connect
(Canvas)
Butterfly (Beyond MDM)
ILG.Next (Cosmos)
•
•
•
•
Information Server
• DataStage
• QualityStage
• Information Analyzer
• Info Governance Catalog
Data Replication
Master Data Management
StoredIQ
StoredIQ for Legal
© 2017 IBM Corporation
Information Server / Data Connect Hybrid Journey
Utilizing the Best from both sides!
Bridge &
Combine
Data
Connect
Converge
21
© 2017 IBM Corporation
Cognitive Integration Design
Next Gen DataStage Designer
 What:


•
•
ZERO Install  Browser based design
ZERO migration  view existing jobs/projects in new designer
Ability to use new & old Designer side by side
New simplified Design experience without compromising
capabilities
• Who:
• Phase 1: New Integration experience for Integration Specialist /
Integration Developers
• Phase 2/3: Self service integration and preparation for
Business and LOB users
22
22
© 2017 IBM Corporation
Maximize your IT resources utilization through hybrid
execution
•
•
Optimize your integration workload based on data locality and resource
availability
DataStage already enables you to design your transformation once and run it
on the PX Engine, a Hadoop cluster, or a database
 Bluemix Data Connect provides a new web-based self-service designer
with a code-gen frameworks to support similar runtime targeting
DataStage/QualityStage Designer
Data Connect UI
Execute “Anywhere”
Databases
23
PX ETL Engine
Spark as a Service / Local
© 2017 IBM Corporation
New, expanded or enhanced Connectivity for both
Structured and Unstructured Data
24
© 2017 IBM Corporation
24
Unified Governance - A New Era of Governance
Governance 1.0
 Data within the firewall
 Distinct capabilities for structured &
unstructured data
 Compliance use cases: e-Discovery,
Records, Archiving, GDPR, BCBS 239,
Basel II etc.
 IT led
Governance 2.0
 Data, API’s, & Analytics in or outside the firewall
(Hybrid platform)
 Common capabilities: Policy Administration, Metadata,
Consent Management, & Stewardship
 Compliance & analytics use cases: Information
Repositories (e.g. Data Lakes), Self-service analytics,
Regulations,
& Data Science, GDPR, BCBS 239, Basel II etc.
 IT & Business led
IT
25
IT
Analysts
Data
Scientists
Developers
© 2017 IBM Corporation
Use Cases Driving a Unified Governance Strategy
GOVERNANCE FOR COMPLIANCE
GOVERNANCE FOR INSIGHTS
Discover, classify and manage information in
ways that meet the obligations enforced by
both regulatory and corporate mandates
Provide safe access to trusted, high quality,
fit-for-purpose data while facilitating effective
collaboration among team members
Regulations (e.g. GDPR)
Self-Service Access to Data & Analytics
Privacy & Protection
Governed Enterprise Information
Repositories (such as Data Lakes)
eDiscovery
Records & Retention
Archiving
Audit Readiness
26
© 2017 IBM Corporation
Unified Catalog – The Core of the Unified Governance
GOVERNANCE SERVICES
Metadata
Test Data
Auto Info
Classification
INFORMATION SECURITY
Privacy/
Protect
Shop 4
Info
Mastered, Open, Enterprise Information Catalog
Archiving
Retention/
Disposal
Collaborate
Structured data
Un-structured info
Other sources
Transformation & Delivery Fabric
Quality
Mgmt
Records
eDiscovery
Workflows
Consent
27
Policy
Mgmt
Lineage
© 2017 IBM Corporation
What do we mean by Hybrid Integration?
Optimizing
workloads
based on Data
Locality
28
Distributing
workloads
across loosely
coupled
runtimes
Choice of
Runtimes
based on your
data delivery
requirements
On-demand /
elastic expansion
Combined SaaS
and on-premise
self service
prep /
integration
© 2017 IBM Corporation
28
Thank You
© 2017 IBM Corporation
30
© 2017 IBM Corporation