IBM Information Economics for Big Data

Information Lifecycle Governance
IBM Information Economics for Big Data
The problem with managing information assets today
Siloed tools and
systems for
creating, managing,
and sharing
documents
Reactive, one-off
approaches to ediscovery and FOIA
requests
2
Massive duplication
within and across
repositories
Disconnect between
retention schedule
and how the
Department works
Everyone keeps
everything forever
2
© 2012 IBM Corporation
Massive growth in structured and unstructured content
Worldwide Corporate Data Growth
80% of Data Growth is Unstructured
Source: IDC
The Digital Universe 2010
3
© 2012 IBM Corporation
Why do companies keep information?
Enterprise
Information
Subject
to Legal
Hold 1%
Hold & Collect Evidence
Has
Business
utility
Archive for Value & Dispose
25%
Retain Records & Dispose
Everything
Else 69%
Cost & Risk
Reduction
Enables
Disposal
Regulatory
Record
Keeping
5%
DISPOSE OF UNNECESSARY
DATA
Cost
Reduction
Normalizes
growth curve
%s based on CGOC Summit 2012 Survey
4
© 2012 IBM Corporation
StoredIQ Customer Value
DATA
INTELLIGENCE
INTELLIGENT
EDISCOVERY
INFORMATION
GOVERNANCE
BUSINESS
ANALYTICS
Identify, Analyze, Act
Litigation Readiness
Policy Management
Business Intelligence
Dimensional Data Maps
Risk Reduction
Storage Optimization
File Share Clean-up
Legal Hold Notifications
True Early Case Assessment
Intelligent Collections
Review Platform Integration
Identify Anywhere
Records Retention
Defensible Deletion
Compliance Enforcement
Holistic Data Views
Recognize Business Value
Collect Unstructured Data
Vertical BA Tool Integration
ACTIVE INFORMATION
PLATFORM
BIG DATA
Archive
Platform
5
5
ECM
Forensic
Images/Tapes
File
Servers
Email
Servers
Desktops/
Mobile
SharePoint &
Enterprise
Collaboration
Cloud
Social
Networks*
Media*
© 2012 IBM Corporation
StoredIQ’s rapid solution validation deployment and information
visualization enables discovery and in-place governance and disposal
Key Findings
Potential Actions
Records Management
Records Management
• We only sampled against 4 of 625+ record categories and
found almost 20,000 files that matched consisting of 6% of
the sample set.
•20% of all objects found were emails on the file system
•Expand to include all 625 record categories
•All records found would be cross referenced against
age and retention polices and then moved to archive
or retention/collaboration platform as needed with all
relevant tags and retention times
Data’s Value and Risk
• We were able to quickly create complex information sets
such as finding all documents related to Apples, customer
accounts and renewals (3700+ files vs. Bananas in the
same query 460 files)
• We utilized the SIQ social security algorithm and found
over 1450 documents containing SSN’s
•Checking the age of both data sets we were able to
determine that most of this data was well past it’s prime
with over 50% 5 years or older
Data’s Value and Risk
•Maintain a full index allowing for right sized, rapid
and accurate exports and results for any discovery
case or needs involving unstructured data
•Expand filters to include credit card data, PII and
ePHI as determined by Customer
•Data that contained PII and ePHI could be moved to
a secure location for proper handling by the security
teams or appropriate parties
Data to Delete or Archive
Data to Delete or Archive
• Using simple filter sets we were able to determine that
over 30% of the data should easily be up for consideration
of archiving and/or disposal
• Aged data over 5 years was spread through various file
types and owners
• Most of this data set did have owners associated with it.
6
•Records, work in progress and relevant data will be
excluded from the aged data sets
•Data beyond it’s life and usefulness could be moved
to a staging area for timely disposal
•Data still in use but older could be moved to archive
platforms or cheaper storage
•Prohibited data could be outright deleted
© 2012 IBM Corporation
Data Age by Data Source
7
© 2012 IBM Corporation
Solution validation overview – Customer provided sample data set
Data Sources
CIFS Collection
Objects- 1,779,557
Size – 461.27GB
Email Objects –
359,556 (20%)
Collection Times
Meta – 4 hours
Full – 48 hours
SharePoint
Collection
Sites – 5
Objects – 1,535
Size - .35GB
Collection Times
Full – 7 minutes
8
Records
Data Cleanup
Data’s Value
We tested against 4
Customer Records
Classifications
We looked at Data
that could be
deleted or archived
ACC-40040
Objects - 17,938
Size - 25.43gb
HUM-70080
Objects - 1,163
Size - 1.75gb
LEG-60200
Objects - 475
Size - .38gb
MAR-10071
Objects - 266
Size - .62gb
Archival Data
Last Accessed 2010
Objects – 826,525
Size – 49.52gb
Disposal Data
Last Accessed 2008
Objects – 452,159
Size – 39.14gb
Prohibited Files
Objects - 153,049
Size – 48.93gb
Totals
Objects – 1,431,733
Size – 137.59 gb
(30% of all storage)
We built layered
information sets that
let us find
Account w/3 of
Information
Objects – 27,928
Size – 17.84gb
Narrowing by
“Renewal”
Objects – 7,860
Size – 12.96gb
Data Clean-up
RESULT
Data’s Risk
We found PII all
over the data set
SSN Algorithm
Objects – 1,488
Size – 1.73gb
Secure PII
Storage
Archive
Platform
Retention
Platform
Collaboration
Platform
50+% was over 5
years old!!!!
• Full end-to-end audit trail for all disposed records, files and data
• Corporate Governance Policy directives proactively achieved in real-time
• Data can be easily moved to the proper retention or collaboration platform
based off records policies
• Data Automation Means
•
Data can be put where it needs to live if it needs to live
•
Risk can be mitigated
•
Data can be available to those who need it when they need it.
Data Debris
To
Be
Deleted
© 2012 IBM Corporation
Putting Data in the right place
Data Sources
CIFS Collection
Objects- 1,779,557
Size – 461.27GB
Email Objects –
359,556 (20%)
Collection Times
Meta – 4 hours
Full – 48 hours
SharePoint
Collection
Sites – 5
Objects – 1,535
Size - .35GB
Collection Times
Full – 7 minutes
Records: Identified though full text search
and moved to retention
Data of Value: Identified though full text search
and moved to either Collaboration or
Retention
PII: Identified though full text search
and moved to Secure Storage
Data Cleanup: Identified though metadata
and full text search and either moved to
Archive or deleted
Retention
Platform
Collaboration
Platform
Secure PII
Storage
Archive
Platform
Data Debris
To
Be
Deleted
9
© 2012 IBM Corporation
FileShares/SharePoint Archiving and Disposal Flow
FileShares
And
SharePoint
Files
2
Files to be
retained (left in
place)
Review by
designated
Business
Unit Experts
Files to be
archived
1
4
Data Analysis to
determine ROT
•Age
•Frequency of access
•Record Codes
•Non-conforming file
types
•….
Files to be
disposed
Candidate Files analyzed
by StoredIQ
Files re-classified
after review from
Business Unit
Owners
P8
Content
Classifier Start Here
Content Classifier Interface is
used to build the Watson based
models for Records
Classification as well as refining
the rules throughout the process
Files Retained (left in
place)
8
Files moved from P8
after waiting period
7
Space reclaimed / reallocated after ATT
Disk/SharePoint reclaim
process
5
6
Files archived using
ICC for fileshares &
SharePoint
Files moved on P8
location (appear
deleted to users)
0
Files to be
retained (left in
place)
Files to be
archived
Files to be
disposed
StoredIQ
10
3
Atlas Policies
Content Collector
for SharePoint &
Files
Disk Space freed up by
Archiving and Disposal on
Primary file / SharePoint
location
(Stubs left in place to access
data from the Archive (P8))
© 2012 IBM Corporation
A picture tells a thousand words
DATA CLASSIFICATION TOUR
11
© 2012 IBM Corporation
While things are indexing… Build Classification Based on Existing
Policies and Retention Schedule
Use existing Email
policy and
Retention
Schedule to define
Classification rules.
12
© 2012 IBM Corporation
Ready to use Enterprise Schedule Classification Rules on all data
Each Rule
classifies using
examples from the
existing Schedule,
and training
examples from the
Line of Business.
Each rule flexible
to handle all the
varied content
types that match
each classification.
13
In addition to Boolean
rules, in production we will
leverage IBM Content
Classification and training
sets to build Watson based
filter models
© 2012 IBM Corporation
Start to View the File results at a very high level, By Type, Location,
Size or Age
This example has
found 20% of the
files in the sample
set are emails
totaling 359,596
objects
14
© 2012 IBM Corporation
Quickly understand where critical data resides in the sample set;
view of the information with LEG-60200 Classification applied
Individual data sets can be
examined and acted upon
Here we can see that the bulk of
LEG-60200 records are word
processing but 24% or 115 are emails
15
© 2012 IBM Corporation
Once classified, the data can be staged to a CIFS Retention location
before ingest into appropriate Systems with meta data
Action Log showing the actions that have been run
against the selected infoset and the results.
16
© 2012 IBM Corporation
Auto-classification tags the objects with metadata respective to the
responsive information sets with retention codes ready for ingest
17
© 2012 IBM Corporation
Data Clean-Up – Defensible Disposal
18
© 2012 IBM Corporation
Dynamic Data Topology Maps
A top three Insurance and Financial services company Identified
1,000’s of unstructured data repositories with relevant claims data
across an enterprise-wide SharePoint implementation
PROBLEM: Identifying breadth
and scope of data in SharePoint
sites that relates to individual
claims.
SOLUTION: DataIQ found over
50,000 sites (many previously
hidden) and correlated content in
versions and social wikis/blogs.
ROI: Reduced manual search by
100% and created defensible
audit trails for any claim. 5:1
savings in first year on claims
management.
19
© 2012 IBM Corporation
eDiscovery: Comprehensive Response
StoredIQ has successfully helped companies meet their
eDiscovery obligation in thousands of cases – with complete
accuracy, reliability and defensibility.
“Companies can more than
justify the purchase of
inhouse eDiscovery software
and expect a return on
investment in 3-6 months, or
after the first matter.”
- Gartner
PROBLEM: Identifying, collecting and preserving historical
information potentially relevant to the Deep Water Horizon
matter in a timely manner.
SOLUTION: StoredIQ enabled rapid indexing of 100s of
terabytes of data including the identification, preservation
and collection across multiple data sources and throughout
multinational organizations.
IMPACT: Respond rapidly to legal discovery requests,
lower eDiscovery costs, ensure complete audit trail and
defensibility.
20
20
© 2012 IBM Corporation
Litigation Readiness
Supported “bet the company” litigation effort. Identified, collected
and analyzed (132) TB of data to produce (200) GB of relevant data.
PROBLEM: For The Deep Water Horizon
matter, look across 132TB's, 3 continents and
8 locations. Collect 1TB to a preservation
location in Houston. Full text indexing and
apply additional terms to reduce to the
smallest defensible data set which was sent
out for production review by outside
counsel. Final data set was approximately
200 GB's.”
SOLUTION: Enable a 100:1 reduction in
collection process in less than 2 weeks.
ROI: Saved $5m+; responded to every
DOJ request; lowered outsourced review
costs and built a defensible audit trial.
Increased Case Preparation Time.
21
© 2012 IBM Corporation
Thank You
22
© 2012 IBM Corporation