Migrating from FAST to EMC Documentum xPlore

Migrating from FAST to EMC
Documentum xPlore: What To Do
and Why You'll Love It
Ed Bueché
EMC Distinguished Engineer and
xPlore Architect
Agenda
• Introduction to xPlore
• xPlore 1.2 new capabilities
• FAST-to-xPlore Migration bestpractices
Documentum xPlore at-a-glance
• Documentum xPlore is the next generation
search for Documentum
• Replaces FAST: By end-of-year 2011 support
will end for the FAST version of Search within
Documentum
• Technology Foundation: EMC xDB (native
XML database) and Lucene
• D6.5 SP2 (and later) compatible Indexing
system
– D6.5 SP2/3, D6.6, D6.7
– Client's supported by D6.5 SP2 will work
without change (with small corner-case
exceptions)
– Dual mode migration: xPlore & FAST both active
(index & query) on the same repository
Migrate to xPlore
• Support for search within Documentum
based on FAST ends Dec 31, 2011
• Feature is replaced by Documentum xPlore
–
–
–
–
xPlore 1.0 since Oct 2010
xPlore 1.1 since April 2011
xPlore 1.2 ~ Nov 2011
No additional license cost for xPlore!
• Hundreds of deployments have already
converted and hundreds more are in
progress
Other Sessions on xPlore at Momentum 2011
• Optimizing EMC Documentum: Performance and
Scalability
– Wednesday, 2nd November 2011, 2:00pm – 2:45pm
– Covers xPlore scalability and tuning best practices
• Optimizing EMC Documentum: Best Practices for
Deployment
– Wednesday, 2nd November 2011, 3:00pm – 3:45pm
– Covers additional Documentum scalability improvements as
well as HA / DR best practices for xPlore
What is new in xPlore 1.2 (Nov 2011)
•
Thesaurus support
o
o
•
Customizable Natural Language Processing Pipeline
o
•
o
o
Multiple CPS-per-xPlore instance
Wildcard performance improvements
Automatic query warmup
New Languages supported
o
•
Ability to subscribe to a query executed on an interval and be
notified of the results
Indexing and query performance
o
•
UIMA support and custom text extraction
Query-based subscriptions
o
•
Synonyms, alternate spellings, acronyms
Based on SKOS structure
Russian, Arabic, Hebrew, and Brazilian Portuguese
Administration
o
o
Improved deployment and automation (CLI)
Silent installer (local & remote)
xPlore Features At-a-Glance
xPlore 1.2: Thesaurus
• Improve the “findability” (or
recall) of the content
• Allows for customer defined
business thesaurus
• Thesaurus support allows you
to query for one name and get
hits in documents that have the
related names
• Example: in the Pharma
industry a drug is known by:
– Scientific name
– Internal code name
– Marketing name
xPlore 1.2 Thesaurus Feature Notes
Simple Knowledge Organizational System (SKOS)
• Simple Knowledge Organizational System (SKOS)
– Standard representation for Thesaurus and
Categorizations, etc
– XML format (RDF)
– Able to Represent synonyms, concepts
• Case and Space insensitive format
• Ability to store multiple Thesauri per Docbase
• Ability to set default thesaurus
• Can override thesaurus per query
• Can specify multiple thesauri per query (clause
specific)
• Support in DFC Search Service & DQL from D6.5 SP2
and later (in latest patches)
Thesaurus: cross-language term normalization and
other use-cases
‘Commission’
• SKOS formatted thesaurus allows for
cross language terminology mapping
• Use-case: Ability to search for content in
one language and get hits in others
– Not a full translation mechanism but
useful for domain specific cross
language terms
– Only one language is lemmatized, so
most useful for names
• Also possible to create other
relationships aside from synonyms
Επιτροπή
комисия
Example: searching for acetaminofén (in Spanish)
With no thesaurus only
spanish documents are
found
With Thesaurus
alternate synonyms in
multiple languages
are found
xPlore Thesaurus Administration
Easy import
mechanism
Can define multiple
thesaurus per docbase
xPlore 1.2 : Customizable Natural Language
Processing Pipeline
• xPlore 1.2 opens the Natural
Language Processing (NLP)
pipeline to customization
– Allows customers to go beyond the
base linguistic analysis
– Able to inject standard UIMA
compliant customizations
• Use cases include
–
–
–
–
Entity extraction
Classification
ID normalization
Custom Text Extraction
Functional view of NLP in xPlore
In coming doc
event
Content
Fetch
Text
extraction
CF
TE
Lang
Identification
LI
Linguistic
analysis
LA
Store in
index
Custom Text Extraction for xPlore
NLP in 1.2
In coming
doc
CF
TE
Text extraction customization
Post-Linguistic analysis
UIMA extensions
LI
LA
Store in
index
• Text Extraction customization based on
Mime-type
• Plugin-customization code can be defined
for:
o Pre-Text extraction
o Text Extraction phase
o Post-Text extraction
• Plugins can be Java or C/C++
UIMA customizations supported post-Linguistic
Analysisfor xPlore
NLP in 1.2
In coming
doc
CF
TE
LI
• Unstructured Information Management
Architecture (UIMA)
• Apache Standard Architecture for
Natural Language Processing
customizations
• xPlore 1.2 allows for UIMA components to
process and annotate document
elements
• Enables annotation of DFTXML without
adding relational columns in the RDBMS
LA
Post-Linguistic analysis
UIMA extensions
Store in
index
Potential UIMA customization use-cases
• ID normalization
– Official Company ID’s: ABC-1234-D567-EFG
– Users want to query on: D567EFG, because ABC-1234 is
not selective
– Pipeline step can create the alternative ID formulation
• Classification
– Examine text of document and automatically tag it with
additional metadata based on a taxonomy
– CIS-based classification documented as an xCellerator
• Entity Extraction
– Extract entities with 3rd party entity extractor
Advanced Customizations:
Remote UIMA & Custom Thesaurus
In coming
doc
CF
TE
LI
Store in
index
LA
Remote UIMA
Query
Processor
Custom
Thesaurus
access
3rd Party external
component
Get additional
terms
Query with additional terms
Query-based Subscriptions (QBS)
• User to subscribe to the periodic
execution of stored query
• Notified of any new results since
the last execution
• Queries execute on defined
intervals
– Hour, day, week, or month
• Result notification
– Email or
– the initiation of an xCP-defined
business process
Query-by-subscription
Overview
Subscription
?
Query
executed
Query executed
automatically
?
?
Stored queries
User
subscribes to
query
Results fed to
business process
User notified
Query defined and stored
results
xCP
Business
Process
Subscribing to a saved query
Users, Subscriptions, Queries, Results
Some Relationships
User ‘A’
Subscription
to query #1
from User ‘A’
User ‘B’
Subscription
to query #2
from User ‘A’
Results: user
‘A’ query #2
Subscription
to query #2
from User ‘B’
Results: user
‘B’ query #2
?
?
Query #1
Query #2
QBS user activity report
Provides information on each user’s subscription activity
QBS activity report by subscription ID
Query-based Subscriptions: Delivery Notes
• Supported only with D6.7 SP1 and later DFC
and Content Servers
• TaskSpace components delivered as
xCellerator
– To be posted when D6.7 ships in Nov
• API available for custom UI’s, this includes
– Stored query definition (dm_smartlist)
– Subscription definition and management
Additional xPlore 1.2 Enhancements
• Multiple CPS processes on single xPlore
Instance
– Significantly simplifies content processing
scaling
• Improved wildcard query performance
• New Language Certifications
– Russian, Arabic, Hebrew, and Brazilian
Portuguese
Agenda
• Introduction to xPlore
• xPlore 1.2 new capabilities
• FAST-to-xPlore Migration bestpractices
FAST-to-xPlore Migration Best-practices at-a-glance
• Stay current with software
• RTO: Backup / Restore
• Plan and Test scale with larger environments
• Convert Legacy DQL Apps to DFC Search Service
• SAN’s Provide best performance
Stay Current with Software
• xPlore 1.1 shipped with DCTM 6.7 in April
2011
– Why would you start your deployment with xPlore
1.0 ?
• Patch Roll up releases available each month
– Available Sept 30: xPlore 1.1 P03 and xPlore 1.0
P13
– Available Oct 30: xPlore 1.1 P04
• Some important items covered
– Snapshot-too-old consistency fixes
– Improved diagnostics and repair for index
inconsistencies
– Fix for result inconsistency due to updates
RTO: Backup / Restore
• Recovery Time Objective
– The target time to restore the system back to service after some sort of failure
– Usually a target set by business users
• Example characterization of hardware failure in Google’s data centers:
– In cluster (of 1800 machines), 1,000 will fail somehow in first year of service
– Thousands of hard drives will fail
– 50% chance that rack will overheat
• xPlore migrations typically involve new hardware in new operating environments
– Human & Environment failures will be higher than normal
• Time to recovery varies
–
–
–
–
Dual xPlore systems provide fastest (but most expensive) RTO
Sometimes (not always) data failure can be rectified with xPlore repair tools
Restore from backup is next fastest
Re-feed from Documentum is the slowest
To be discussed in more detail on Wed, Nov 2 in
“Optimizing EMC Documentum: Best Practices for Deployment”
at 3pm
Convert Legacy DQL Apps to DFC Search Service
• API Options for Documentum Search Applications
– IDfQuery and DQL
• Legacy compatibility
– DFC Search Service & automatically generated XQuery
• Foundation for Advanced Search since D6.6
– IDfXQuery and custom defined Xquery
• Used primary for Zone Search of XML
• For most uses, the DFC Search Service is the best
choice
– Best performance: Pulls the least amount of data per bounded
result
– Native facets supported
• Not part of DQL
• avoids huge result set ingestion
– More efficient date range query processing
Plan and Test scale with larger environments
• xPlore provides great out-of-box support for
Documentum
• However, some aspects might require index tuning
– If index tuning not done, then re-feed or index re-build might
be required
– Important to find these in larger test environments than
production environments
• Items to watch for
–
–
–
–
Special character tuning
Date and integer values for custom DCTM object types
Metadata wildcard optimization options
Native Facet values
• Leverage Free xPlore tools on EDN!
– https://community.emc.com/docs/DOC-8922
SAN’s provide best performance
• xPlore Supports SAN’s, NAS, and local disk
• Services and functionality varies across disk
hardware
• All-else-equal: SAN volumes offer best performance
and are recommended
• However, some capabilities not available
– Basic xPlore host sparing
– Simple inter-host data movement
• If leveraging NAS, please review latest guidelines on
configuration
EMC Symmetrix:
Nondisruptive Mobility
Virtual LUN VP Mobility
Virtual Pools
Flash
400 GB
RAID 5
Fibre Channel
600 GB 15K
RAID 1
SATA
2 TB
RAID 6
• Fast, efficient mobility
• Maintains replication and
quality of service during
relocations
• Supports up to thousands of
concurrent VP LUN
migrations
• Recommendation: work with
storage technicians to
ensure backend storage has
sufficient I/O
• Questions?
• Comments?
• See xPlore on EMC Developer Network:
– https://community.emc.com/docs/DOC-8945
THANK YOU
This presentation is also available at
www.momentumeurope.com
password: spree