The Content Intelligence Company Eric Rossborough Bytes, Basics and Beyond March 2017 About us Haystac • • • • • The Content Intelligence Company Privately held & self-funded Launched in 2014 Headquarters in Newton, Massachusetts ~ 20 employees Working with Engineering and Operations, we identified and classified a large set of scanned documents to address regulatory compliance requirements around key topics. We developed a County-wide solution to classify and extract data points for large volume of scanned images as well as electronic stored information (including emails). The Content Intelligence Company Situation analysis – Some goofy math For a sense of scale - Some goofy math (just for fun): In shared files today at a large US Bank: Estimates in PB = 10,000 TB = 10,000,000 GB = 150,000,000,000 pages of “Dark Data” • • • • • 1 box of documents = 0.833 ft high 150,000,000,000 pages = 75,000,000 boxes Bank Building in Atlanta = 310 ft = 372 boxes 10PB = 75,000,000 boxes = 201,612 bank towers = 31 miles Distance from surface of earth to stratosphere = 30 miles The Content Intelligence Company Why Content Analytics • Reduce information security risk • • • • Lower storage management costs • • Commercial/retail loan origination Forensic accounting Dynamically classify content according to business value and events • • • • Consistently apply best practices for Information Governance. Minimize end-user impact on content indexing Eliminates ROT data Accelerate document- based business processes • • • Reliably identify Relevant vs Redundant, Obsolete, and Trivial (ROT) content Improve accuracy and speed of content searches • • • • Reduce potential hack “strike zone” PII, PCI, HCI, etc. Confidential or restricted content Mergers and acquisitions Litigations and e-discovery Audits Report on content for advanced analytics The Content Intelligence Company Cross-Industry Use Cases The Content Intelligence Company • Storage management and legacy information cleanup • IT cost reduction • Information governance • Corporate and regulatory compliance • Information Security • Sensitive PII/PCI/PHI content identification and remediation • Retention/disposition content • • • • Data Monetization Data Migration Litigation and E-Discovery acceleration Process improvement initiatives • Mergers and acquisitions • Document analytics Cross-Industry Use Cases - Examples • • Large US Bank – Cost reduction and Information Security • 25 PB of content in file shares - $100 M/year expenditure and growing • 6 versions of File Net, SharePoint – expensive to maintain, poor user value • Large stream of digitized paper coming from business (retail banking in particular) Large Electrical Utility – Info security and governance • • Integrated Oil and Gas – Acquisition (Data Migration) • • Mandate to migrate from ECM (OpenText, SharePoint) and file Shares to Corporate ECM (Documentum) Large Canadian Bank – Migration and Governance • • The Content Intelligence Company 6 PB of content in OpenText Content Server + x PB in FileShares • Under corporate mandate to universally develop and apply retention and disposition policies Over 3,000 applications running on Notes • Corporate mandate to migrate content to Corporate ECM (Documentum) Over 5 PB of contents derived from acquisition • Unknown value and risk of content • Large volume of PST files Our discussion today – Large US Bank • Historically, long term storage of XXXX’s information assets (data and edocuments) has supported an environment where structured and unstructured information is over-retained, and disposed of infrequently and inconsistently. • • • • • • The Content Intelligence Company User-created records can be stored anywhere Little or no retention or Lifecycle Governance (value vs risk) Lack of search findability Not always secure – can contain PHI, PII orother sensitive information Increased cost for e-discovery, storage, and backups Increased RISK ! Problem Statement • Using Indāgō Content Analytics - Crawled and Indexed all NAS Drives – both personal and shared drives • Presented our findings from a high level review of the primary NAS storage environments: • Surfaced current storage size of ~1.9 PB and corresponding managed storage costs of ~$18 MM / yr. • Initial estimates have surfaced that operationalizing the disposal of unnecessary data could reduce storage expenses by ~$10MM in year one (with organic growth / ROT reduction assumptions). • In an effort to validate the size of the opportunity, there is a need to interrogate storage environments and quantify business benefits associated with disposing of “ROT” data (Redundant, Obsolete, Transient) as defined by corporate policies. 8 The Content Intelligence Company Current Unstructured Content Environment High Level Findings Environment ROT Summary – 6/08/2016 The Content Intelligence Company Business Case Financial Impact Unstructured Cleansing * 1) Prohibited File Types 7% - Review File List - Haystac to ID Data - Mitigate 2) Non-Accountable Data 8% - Abandoned Home Shares - Orphaned Home Shares - N/A Data in Common Shares 3) Aging Data 25% - Home Share Data 2+ Years (85 TB) - Records past retention - Common Share or Shared Drives 4) Duplicate Data 10% - Home Shares - Shared Drives - Across Enterprise * Very Conservative percentages with Content Analytics The Content Intelligence Company Using the same data that was originally presented: $10M/y x 50% = $ 5M/y Using 40% New Data Growth Rate (NDGR) $25 M saved over 5 years What is Haystac Indāgō • Comprehensive and scalable Content Analytics • • Searches, crawls, profiles and clusters unstructured data repositories • • • • File-shares, Email, Google Drive, Enterprise Content Management (ECM), SharePoint, Office365, etc. Identifies ROT and Sensitive data Automatically profiles and clusters relevant data Manages content and metadata in-place within ECM • • • • • • Machine learning and Visual Content Intelligence Connectors to FileNet, Documentum, OpenText ContentServer, SharePoint, Google Drive, etc. 600+ files types, including scanned and pdf documents Applies dynamically known or derived classification model Applies visual classification to scanned and pdf content Applies retention policies to content Automatically extracts data points from content • • Auto-indexes electronic documents Targeted OCR (visual anchor) for scanned and pdf documents The Content Intelligence Company What is Move to Manage – Process • 1 File Share File Share File Share File Share ROT • Nonrecords Identifies ROT, Non-records, Dups and Near-Dups to reduce volume of content to be moved Tags sensitive content FileNet Dups 2 SharePoint Leverage existing protocols, connectors and accelerators Records File Share File Share File Share File Share The Content Intelligence Company What is Manage in Place – Process Crawl based inventory of content and meta-data Likely daily syndication Google Drive Aodocs Published reports of meta-data updates • • • The Content Intelligence Company Haystac Indago integrates with key ECM systems and classifies content, providing decision support Disposition or management of content will happen at system of record System of Management responsible for CRUD action (Create, Replace, Update, Delete) Unleash the Power of Content Understand, Classify, Act Director of Sales: Eric Rossborough – [email protected] The Content Intelligence Company
© Copyright 2026 Paperzz