10: E Rossborough Data Analytics

The Content Intelligence Company
Eric Rossborough
Bytes, Basics and Beyond
March 2017
About us
Haystac
•
•
•
•
•
The Content Intelligence Company
Privately held & self-funded
Launched in 2014
Headquarters in Newton, Massachusetts
~ 20 employees
Working with Engineering and
Operations, we identified and classified
a large set of scanned documents to
address regulatory compliance
requirements around key topics.
We developed a County-wide solution to
classify and extract data points for large
volume of scanned images as well as
electronic stored information (including
emails).
The Content Intelligence Company
Situation analysis – Some goofy math
For a sense of scale - Some goofy math (just for fun):
In shared files today at a large US Bank:
Estimates in PB = 10,000 TB = 10,000,000 GB = 150,000,000,000 pages of “Dark Data”
•
•
•
•
•
1 box of documents = 0.833 ft high
150,000,000,000 pages = 75,000,000 boxes
Bank Building in Atlanta = 310 ft = 372 boxes
10PB = 75,000,000 boxes = 201,612 bank towers = 31 miles
Distance from surface of earth to stratosphere = 30 miles
The Content Intelligence Company
Why Content Analytics
•
Reduce information security risk
•
•
•
•
Lower storage management costs
•
•
Commercial/retail loan origination
Forensic accounting
Dynamically classify content according to business value
and events
•
•
•
•
Consistently apply best practices for Information Governance.
Minimize end-user impact on content indexing
Eliminates ROT data
Accelerate document- based business processes
•
•
•
Reliably identify Relevant vs Redundant, Obsolete, and Trivial (ROT)
content
Improve accuracy and speed of content searches
•
•
•
•
Reduce potential hack “strike zone”
PII, PCI, HCI, etc.
Confidential or restricted content
Mergers and acquisitions
Litigations and e-discovery
Audits
Report on content for advanced analytics
The Content Intelligence Company
Cross-Industry Use Cases
The Content Intelligence Company
•
Storage management and legacy information cleanup
• IT cost reduction
•
Information governance
• Corporate and regulatory compliance
•
Information Security
• Sensitive PII/PCI/PHI content identification and remediation
• Retention/disposition content
•
•
•
•
Data Monetization
Data Migration
Litigation and E-Discovery acceleration
Process improvement initiatives
• Mergers and acquisitions
• Document analytics
Cross-Industry Use Cases - Examples
•
•
Large US Bank – Cost reduction and Information Security
• 25 PB of content in file shares - $100 M/year expenditure and growing
• 6 versions of File Net, SharePoint – expensive to maintain, poor user
value
• Large stream of digitized paper coming from business (retail banking
in particular)
Large Electrical Utility – Info security and governance
•
•
Integrated Oil and Gas – Acquisition (Data Migration)
•
•
Mandate to migrate from ECM (OpenText, SharePoint) and file Shares to
Corporate ECM (Documentum)
Large Canadian Bank – Migration and Governance
•
•
The Content Intelligence Company
6 PB of content in OpenText Content Server + x PB in FileShares
• Under corporate mandate to universally develop and apply retention and
disposition policies
Over 3,000 applications running on Notes
• Corporate mandate to migrate content to Corporate ECM
(Documentum)
Over 5 PB of contents derived from acquisition
• Unknown value and risk of content
• Large volume of PST files
Our discussion today – Large US Bank
• Historically, long term storage of XXXX’s information assets (data and edocuments) has supported an environment where structured and unstructured
information is over-retained, and disposed of infrequently and inconsistently.
•
•
•
•
•
•
The Content Intelligence Company
User-created records can be stored anywhere
Little or no retention or Lifecycle Governance (value vs risk)
Lack of search findability
Not always secure – can contain PHI, PII orother sensitive information
Increased cost for e-discovery, storage, and backups
Increased RISK !
Problem Statement
• Using Indāgō Content Analytics - Crawled and
Indexed all NAS Drives – both personal and shared
drives
• Presented our findings from a high level review of
the primary NAS storage environments:
• Surfaced current storage size of ~1.9 PB and
corresponding managed storage costs of ~$18 MM / yr.
• Initial estimates have surfaced that operationalizing the
disposal of unnecessary data could reduce storage
expenses by ~$10MM in year one (with organic growth /
ROT reduction assumptions).
• In an effort to validate the size of the opportunity, there
is a need to interrogate storage environments and
quantify business benefits associated with disposing of
“ROT” data (Redundant, Obsolete, Transient) as defined
by corporate policies.
8
The Content Intelligence Company
Current Unstructured Content
Environment
High Level Findings
Environment ROT Summary – 6/08/2016
The Content Intelligence Company
Business Case
Financial Impact
Unstructured Cleansing *
1) Prohibited File Types
7%
- Review File List
- Haystac to ID Data
- Mitigate
2) Non-Accountable Data
8%
- Abandoned Home Shares
- Orphaned Home Shares
- N/A Data in Common Shares
3) Aging Data
25%
- Home Share Data 2+ Years (85 TB)
- Records past retention
- Common Share or Shared Drives
4) Duplicate Data
10%
- Home Shares
- Shared Drives
- Across Enterprise
* Very Conservative percentages with Content Analytics
The Content Intelligence Company
Using the same data that was originally presented:
$10M/y x 50% = $ 5M/y
Using 40% New Data Growth Rate (NDGR)
$25 M saved over 5 years
What is Haystac Indāgō
•
Comprehensive and scalable Content Analytics
•
•
Searches, crawls, profiles and clusters unstructured
data repositories
•
•
•
•
File-shares, Email, Google Drive, Enterprise Content
Management (ECM), SharePoint, Office365, etc.
Identifies ROT and Sensitive data
Automatically profiles and clusters relevant data
Manages content and metadata in-place within
ECM
•
•
•
•
•
•
Machine learning and Visual Content Intelligence
Connectors to FileNet, Documentum, OpenText
ContentServer, SharePoint, Google Drive, etc.
600+ files types, including scanned and pdf documents
Applies dynamically known or derived classification
model
Applies visual classification to scanned and pdf content
Applies retention policies to content
Automatically extracts data points from content
•
•
Auto-indexes electronic documents
Targeted OCR (visual anchor) for scanned and pdf
documents
The Content Intelligence Company
What is Move to Manage – Process
•
1
File Share
File Share
File Share
File Share
ROT
•
Nonrecords
Identifies ROT, Non-records, Dups and Near-Dups
to reduce volume of content to be moved
Tags sensitive content
FileNet
Dups
2
SharePoint
Leverage existing
protocols, connectors
and accelerators
Records
File Share
File Share
File Share
File Share
The Content Intelligence Company
What is Manage in Place – Process
Crawl based inventory of content and meta-data
Likely daily syndication
Google
Drive
Aodocs
Published reports of meta-data updates
•
•
•
The Content Intelligence Company
Haystac Indago integrates with key ECM systems and classifies content, providing
decision support
Disposition or management of content will happen at system of record
System of Management responsible for CRUD action (Create, Replace, Update,
Delete)
Unleash the Power of Content
Understand, Classify, Act
Director of Sales: Eric Rossborough – [email protected]
The Content Intelligence Company