GPO FDsys Program Startup

FDsys
Data Analysis and Parsing
0
GPO Fdsys – Search Engine Configuration
July 28, 2017
Data Analysis and Parsing
FDsys
Agenda:
• Data Management Definition
• Parsing
• fdsys.xml
1
GPO Fdsys – Search Engine Configuration
July 28, 2017
FDsys
Data Management Definition
(DMD)
2
GPO Fdsys – Search Engine Configuration
July 28, 2017
Data Management Definition
(DMD)
FDsys
• Purpose of the Data Management Definition (DMD)
–
–
–
–
Define collection-specific metadata elements
Specify roles for the granules, if applicable
Collection-specific schema definition for FDsys.xsd
Define mappings of metadata elements for Documentum
and FAST
– Define mappings to metadata standards
• One DMD for each collection
• PMO & dev team collaborative effort for CDM
documentation development
• Is both a document and a process
3
GPO Fdsys – Search Engine Configuration
July 28, 2017
The DMD Defines how Data
Flows Through FDsys
Business Process Overview
Submission
what renditions
are available?
how will
metadata be
extracted and
merged?
Congressional Submission
Workflow (interactive)
Submission
Process
Bulk Submission
Process
Ingest Process
Archival Updating
Workflow
Archival Processing
Workflow
Processing
Package Updating
Workflow
Access Processing
Workflow
what manual
edits may be
required?
how will the
HTML rendition
be created
Access
Public User
Access & Delivery
Application
Authorized User
Access & Delivery
Application
what’s on the
search form?
4
how will the
MODS be
created?
ILS Integration
Application
Publishing Process
how are PDF files
processed?
FDsys
Preservation
Congressional Submission
Workflow (folder)
Migration
Application
how will parser
data and input
files be validated
what do content
how are search
URLs look like?
GPO Fdsys – Search Engine Configuration
results formatted?
how will the
content and
metadata be
indexed
what are the
navigators?
July 28, 2017
DMD – Table of Contents
1.
2.
3.
4.
5.
6.
7.
FDsys
General Description
fdsys.xml Schema Elements
Renditions, Plant Processing and Interractions
Parser Definition – Extraction patterns and algorithms
Content Management
Content Publishing and Index
Search and Browse
•
Search results, navigators, and collection browsing
8. Content Delivery
•
URLs, content-detail, Front page, actions
9. mods.xml mappings
5
GPO Fdsys – Search Engine Configuration
July 28, 2017
Documentum
Metadata Flow Diagram
Submission
Original
Content
content
file(s)
Validate, cleanup,
normalize, and
extend metadata
and renditions
Parse
fdsys
xml
content
file(s)
Metadata Flow FDsys
Diagram
fdsys
xml
publish
Index Push
FAST Notification
ACP
Cache
index.xml
fdsys
xml
.xslt
mods.xml
content
file(s)
.xslt
FASTXML
.xslt
Index
.xslt
.xslt
.xslt
.xslt
FAST
indexes
Content Detail
MODS
XML
6
PREMIS
[field1]:
[field 2]:
[field 3]:
.
.
.
[data1]
[data2]
[data3]
Package TOC:
[collection]
[congress num]
[document type]
[chapter]
[chapter]
[section]
[article **]
[chapter]
.
.
.
Search
normalize data
and map to FQL
map
navigators
[per collection]
map
fields
[per collection]
Search Form:
Search Results:
Collection Browsing:
field1:
field2:
field3:
field4:
1. [title] [ [type] [size] ]
[line 2]
[teaser...] [more...]
2. [title] [ [type] [size] ]
[line 2]
[teaser...] [more...]
[collection]
[congress number]
[document type]
[document version]
[_________]
[________v]
from [___] to [___]
[_________]
GPO Fdsys – Search Engine Configuration
July 28, 2017
Documentum
Metadata Flow Diagram
Submission
Original
Content
content
file(s)
Validate, cleanup,
normalize, and
extend metadata
and renditions
Parse
fdsys
xml
fdsys.xml
Metadata
structure
content
file(s)
Flow FDsys
Diagram
fdsys
xml
parsing rules
CMS
metadata
mapping
publish
search index
field mapping
Index Push
FAST Notification
ACP
Cache
mods
mapping
index.xml
fdsys
xml
.xslt
mods.xml
content
file(s)
.xslt
FASTXML
.xslt
Index
.xslt
.xslt
.xslt
.xslt
content-detail
mapping
Content Detail
MODS
XML
7
PREMIS
[field1]:
[field 2]:
[field 3]:
.
.
.
[data1]
[data2]
[data3]
Package TOC:
[collection]
[congress num]
[document type]
[chapter]
[chapter]
[section]
[article **]
[chapter]
.
.
.
search-form
mapping
normalize data
and map to FQL
FAST
indexes
search results
mapping
Search
browse
algorithm
map
navigators
[per collection]
map
fields
[per collection]
Search Form:
Search Results:
Collection Browsing:
field1:
field2:
field3:
field4:
1. [title] [ [type] [size] ]
[line 2]
[teaser...] [more...]
2. [title] [ [type] [size] ]
[line 2]
[teaser...] [more...]
[collection]
[congress number]
[document type]
[document version]
[_________]
[________v]
from [___] to [___]
[_________]
GPO Fdsys – Search Engine Configuration
July 28, 2017
Federal Register Granules
FDsys
• Each article is a
granule
• Each Part is a single
granule
• There are no higherlevel granules
• Sections are not
preserved as
independent
granules
8
GPO Fdsys – Search Engine Configuration
July 28, 2017
Federal Register Example Metadata
FDsys
agencies
title
action
summary
dates
contact
FR Doc Number
9
Billing Code
GPO Fdsys – Search Engine Configuration
July 28, 2017
Content Files
FDsys
Input Files
Renditions
locator
locator
SGML
SGML
CDTP
text
extract granules
pdfsubmitted
PDF
OCR embedded
images
extract
granules
pdf (public)
Create “FrontMatter”, “ReaderAids”,
and “Issue” PDF files
10
GPO Fdsys – Search Engine Configuration
July 28, 2017
Content Files –
Creating the HTML Rendition
text
Add HTML headers
and header metadata
FDsys
Add URL and
E-mail links
embed image
tags
pdfsubmitted
extract images
as JPEG
OCR images
11
html
image
s
html (public)
longdesc
text
GPO Fdsys – Search Engine Configuration
July 28, 2017
Extracting Metadata
(TOC headings)
SGML
TOC
SGML
content
CDTP
FDsys
parse
parse
parse
overwrite
add
Merged
Metadata
• Metadata is merged based on the FR Doc Number
12
GPO Fdsys – Search Engine Configuration
July 28, 2017
Search Results
collection
volume
firstpage
FDsys
action
(first 20 chars)
section
rin
title
73 FR 22020 - Title I-Improving the Academic Achievement of the Disadvantaged [PDF 123 KB]
Federal Register. Proposed Rules. Notice of proposed rulemaking. RIN 0324-AJ10. Wednesday, April
23, 2008.
...The Secretary proposes to amend the regulations governing programs administered under Part A of
Title I of the Elementary and Secondary Education Act of 1965, as amended (ESEA)... More
Information...
publishdate
13
teaser
GPO Fdsys – Search Engine Configuration
link to
content-detail
July 28, 2017
FR Navigators
FDsys
• Section
• Agency
• CFRs
– Hierarchial
+ 15 CFR
- Part 12
- Part 13
- Part 14
+ 16 CFR
- Part 412
- Part 413
14
GPO Fdsys – Search Engine Configuration
July 28, 2017
Collection Browsing
FDsys
yearnav
monthnav
daynav
agencynav
15
GPO Fdsys – Search Engine Configuration
July 28, 2017
Advanced Search Form
16
GPO Fdsys – Search Engine Configuration
FDsys
July 28, 2017
Package-Level URLs
FDsys
• Package Content Detail
– http://www.gpo.gov/fdsys/pkg/FR-2006-01-01/content-detail.html
• Package Metadata Standards
– http://www.gpo.gov/fdsys/pkg/FR-2006-01-01/mods.xml
– http://www.gpo.gov/fdsys/pkg/FR-2006-01-01/premis.xml
• Package Table of Contents
– http://www.gpo.gov/fdsys/pkg/FR-2006-01-01/toc.html
• Today’s Table of Contents
– http://www.gpo.gov/fdsys/html/FR/todays_toc.html
17
GPO Fdsys – Search Engine Configuration
July 28, 2017
Granule-Level URLs
FDsys
• HTML and PDF Files
– http://www.gpo.gov/fdsys/pkg/FR-2006-01-01/html/E6-1423.html
– http://www.gpo.gov/fdsys/pkg/FR-2006-01-01/pdf/E6-1423.pdf
• Granule Content Detail
– http://www.gpo.gov/fdsys/granule/FR-2006-01-01/
E6-1423/content-detail.html
• Granule Metadata Standards
– http://www.gpo.gov/fdsys/granule/FR-2006-01-01/
E6-1423/mods.xml
18
GPO Fdsys – Search Engine Configuration
July 28, 2017
Content Detail
19
FDsys
GPO Fdsys – Search Engine Configuration
July 28, 2017
FDsys
Parsing
20
GPO Fdsys – Search Engine Configuration
July 28, 2017
Parsing Overview
FDsys
• Runs regular expressions to extract metadata
Regular Expression:
(Public Law|Pub. L.|PL|P. L.) (1[0-9][0-9])-([0-9]+)
Example: Pub.
Produces: <law
L. 109-130
congress="109" number="130"/>
• Written in Java
• Called from Documentum when a package needs to
be parsed
• Produces an instance of fdsys.xml
– Parsing has an internal XML format (called the “raw”
XML) which is transformed to produce the fdsys.xml
21
GPO Fdsys – Search Engine Configuration
July 28, 2017
Parser Foundation Classes
FDsys
PContainer
PParser
PPackage
PRendition
PFile
PGranule
USCODE
Parser
USCODE
Package
USCODE
FR
Rendition
Rendition
USCODE
FR
FR
File
File
File
USCODE
Granule
• Foundation classes handle 95% of parsing needs
• Derived classes handle all special cases
22
GPO Fdsys – Search Engine Configuration
July 28, 2017
PContainer
FDsys
• Takes patterns and produces elements
• Holds XML at each level of the parsing process
XML Fragment
PPattern
used_by
PContainer
XML DOM
used_by
produces
stored_in
<publicLaw>
<congressNum>109
<lawNum>123
</publicLaw>
"(Public Law|Pub. L.|P. L.)
(1[0-9][0-9])-([0-9]+)"
23
GPO Fdsys – Search Engine Configuration
July 28, 2017
Parser Foundation Classes
FDsys
PContainer
PParser
PPackage
PRendition
PRendition
XML DOM
XMLXML
DOM
PFile
PFile
PFile
XMLXML
DOM
XML
PGranule
PGranule
PGranule
XMLXML
DOM
XML
XSLT
priority
merge
append
append
xml
24
GPO Fdsys – Search Engine Configuration
July 28, 2017
Parsing XML Documents
FDsys
PContainer
PParser
PPackage
PRendition
PRendition
XML DOM
XMLXML
DOM
PFile
PFile
PFile
XMLXML
DOM
XML
XSLT
priority
merge
append
XSLT
fdsys.xml
bills.xml
25
GPO Fdsys – Search Engine Configuration
July 28, 2017
Other Parsing Considerations
FDsys
• Heuristics testing is integrated into the parsing
– PEHelper: Checks for heuristics and adds “quality=“
attributes
• Output can be automatically Schema-Validated
– Schema-Validation is run on all fdsys.xml formats
produced by the parser
• Parser Validation Tool
– Used by GPO to validate that parsers meet the 90%
Service Level Agreement for accuracy
– Randomly selects 100 documents or granules
– Displays metadata & original text for manual review
– Produces Validation Report
26
GPO Fdsys – Search Engine Configuration
July 28, 2017
FDsys
fdsys.xml
27
GPO Fdsys – Search Engine Configuration
July 28, 2017
FDsys.xml Purpose
FDsys
• Internal container of metadata related to package
• Is a detailed representation/model of the data
structure across all of FDsys
• Reduces duplication of data across metadata
formats
• Reduces number of required transformations
• Can be transformed into standard schemas
including:
– METS
– MODS
– PREMIS
28
GPO FDsys – Data Model
June 13, 2008
FDsys.xml General Structure
FDsys
Header
Content
Metadata
29
GPO FDsys – Data Model
June 13, 2008
FDsys
FDsys Publish and Search
30
GPO FDsys – Data Model
June 13, 2008
Publish and Search
FDsys
Agenda:
• FDsys Publish
• Search Engine Configuration
• Search Engine Application Services
31
GPO FDsys – Data Model
June 13, 2008
FDsys
FDsys Publish
32
GPO Fdsys – Search Engine Configuration
July 28, 2017
High-Level SW Components
FDsys
Ingest Component
Content Processing
Submission Component
- Submission Workflows
- WebTop Submission User interfaces
- Content Parsers
- Migration Tool
- Processing Workflows
- WebTop User Interfaces
- Package Management
- ILS Integration
Archive Preservation
Access Component
- Full-Fledged Search Application
- Full Text Search Engine
- Public Content Access and Delivery
- Archival Workflows
- WebTop User Interfaces
- Preservation Process
Infrastructure Component
- COTS-based LDAP Integration
33
GPO Fdsys – Search Engine Configuration
July 28, 2017
Content Publishing - Overview
FDsys
• Communicates from Documentum to Access
• From: Documentum
– Extract fdsys.xml & premis.xml
– Extract renditions and content files
– Uses Documentum native DFC calls
• To: ACP Cache
– Stores metadata and content files
• To: FAST ESP Search Engine
– Converts fdsys.xml to FAST.xml -> to indexer
• Includes the mods.xml (indexed into ESP)
• ESP pulls in content files automatically
– Uses FAST ESP content_api & search_api calls
34
GPO Fdsys – Search Engine Configuration
July 28, 2017
Component Interfaces
CMS
FDsys
Access
Package
Updating
Workflow
FAST
Web
Application
Content
Publishing
Access
Processing
Workflow
ACP
Cache
UPDATE THIS
35
GPO Fdsys – Search Engine Configuration
July 28, 2017
Component Interfaces
FDsys
HTTP
Commands
FAST
Content
Management
System
pull
Content
Publishing
push
ACP
Cache
36
GPO Fdsys – Search Engine Configuration
July 28, 2017
Major Architectural Decisions
FDsys
• Pull from Documentum, not Push
– Maintenance of Access Subsystem databases
becomes the responsibility of the Access Subsystem
– Data is pulled from Documentum only as needed
• Avoids overflow/queuing problems
– Allows multiple access systems to be fielded
• Search for Deletes in FAST
– Packages can contain many granules
– When updating the FAST indexes, use search to find
the list of all nested granules in the indexes
– Guaranteed to avoid any “orphan” granule problems
37
GPO Fdsys – Search Engine Configuration
July 28, 2017
ACP Cache Directory Structure
FDsys
Proposed ACP Cache Directory: (limits entries per directory to 256)
/ACP/hh/hh/hh/pkgXXXXXXXXXX/<package-contents>
Hexidecimal representation of the lower 24
bits of the MD5 hash of the package ID
38
Package ID
GPO Fdsys – Search Engine Configuration
July 28, 2017
granule
file(s)
Original
Content
Parse
fdsys
xml
Documentum
Validate
Values &
Normalize
Metadata Flow
FDsys
Diagram
granule
file(s)
fdsys
xml
publish
Index Push
FAST Notification
ACP Cache
index.xml
fdsys
xml
.xslt
search.xml
granule
file(s)
.xslt
FASTXML
.xslt
Index
.xslt
.xslt
.xslt
.xslt
FAST
indexes
Content Detail
MODS
XML
39
PREMIS
[field1]:
[field 2]:
[field 3]:
.
.
.
[data1]
[data2]
[data3]
Package TOC:
[collection]
[congress num]
[document type]
[chapter]
[chapter]
[section]
[article **]
[chapter]
.
.
.
Search
normalize data
and map to FQL
map
navigators
[per collection]
map
fields
[per collection]
Search Form:
Search Results:
Collection Browsing:
field1:
field2:
field3:
field4:
1. [title] [ [type] [size] ]
[line 2]
[teaser...] [more...]
2. [title] [ [type] [size] ]
[line 2]
[teaser...] [more...]
[collection]
[congress number]
[document type]
[document version]
[_________]
[________v]
from [___] to [___]
[_________]
GPO Fdsys – Search Engine Configuration
July 28, 2017
Implementation Detail
FDsys
FDsys
Publish
servlet wrapper
Processing Requests via URL
Documentum
Documentum APIs
ACP Cache:
FAST Search
for Deletes
FAST Content
Processing
FAST
Search
Engine
40
GPO Fdsys – Search Engine Configuration
fdsys.xml
fdsys.xml and
content files for
each Package
individual
granule
files
July 28, 2017
FDsys
Search Engine Configuration
Design
41
GPO Fdsys – Search Engine Configuration
July 28, 2017
Component Interfaces
CMS
FDsys
Access
Package
Updating
Workflow
FAST
Web
Application
Content
Publishing
Access
Processing
Workflow
ACP
Cache
Update This
42
GPO Fdsys – Search Engine Configuration
July 28, 2017
FAST System – Hardware & Network
FDsys
publish &
admin
document
processors
index &
search
index &
search
index &
search
index &
search
index &
search
search
search
search
search
search
Web
Application
43
GPO Fdsys – Search Engine Configuration
July 28, 2017
FAST System – Indexing Flow
44
FDsys
publish &
admin
document
processors
index &
search
index &
search
index &
search
index &
search
index &
search
search
search
search
search
search
GPO Fdsys – Search Engine Configuration
July 28, 2017
FAST System – Search
index &
search
FDsys
index &
search
index &
search
index &
search
index &
search
search
search
search
search
search
QR server
QR server
QR server
QR server
QR server
Web
Application
45
GPO Fdsys – Search Engine Configuration
July 28, 2017
Search Engine Sizing: Columns
FDsys
• Total Number of Documents
– Estimated 10 million records
• Each granule = 1 Search Engine document
– Allow 2x expansion for estimation errors and growth
– Estimated 20 million records
• Sizing Recommendation:
– FAST recommends: 5 million records per column
• For public facing web sites
– 5 columns: to account for the large number of
navigators
46
GPO Fdsys – Search Engine Configuration
July 28, 2017
Search Engine Sizing: Disk
FDsys
• Year 2006 FR – Index Sizing Test
Documents
Text
Fixml
Index
Total FAST
31,500
500mb
230mb
604mb
834mb
• Scale to 20 million documents
– Fixml: ~150gb
– Index: ~420gb
• Total index space required:
– 150gb + (420gb)*2 = 1tb
– Add 50% for estimation error, total = 1.5tb
47
GPO Fdsys – Search Engine Configuration
July 28, 2017
Search Engine Sizing: QPS
FDsys
• Queries per second – Estimated from GPO Access
– 0.8 QPS (across the whole day)
– Estimated peak: 2.4 qps (1/2 of queries in 4 hours)
• Estimated Peak QPS for FDsys:
–
–
–
–
Factor for improved search interface: 3x
Factor for growth: 2x
Estimated: 2.4 x 2 x 3 = ~15 QPS
Correllates with other websites known to ST
• Each row: 20-30qps
– Therefore: 1 row for query performance
• Recommend: 2 rows
– 2nd row for redundancy, failover, and maintenance
48
GPO Fdsys – Search Engine Configuration
July 28, 2017
Submission
Original
Content
Documentum
content
file(s)
Validate, cleanup,
normalize, and
extend metadata
and renditions
Parse
fdsys
xml
content
file(s)
Metadata Flow FDsys
Diagram
fdsys
xml
publish
search index
field mapping
Index Push
FAST Notification
ACP
Cache
index.xml
fdsys
xml
.xslt
mods.xml
content
file(s)
.xslt
FASTXML
.xslt
Index
.xslt
.xslt
.xslt
.xslt
FAST
indexes
Content Detail
MODS
XML
49
PREMIS
[field1]:
[field 2]:
[field 3]:
.
.
.
[data1]
[data2]
[data3]
Package TOC:
[collection]
[congress num]
[document type]
[chapter]
[chapter]
[section]
[article **]
[chapter]
.
.
.
Search
normalize data
and map to FQL
map
navigators
[per collection]
map
fields
[per collection]
Search Form:
Search Results:
Collection Browsing:
field1:
field2:
field3:
field4:
1. [title] [ [type] [size] ]
[line 2]
[teaser...] [more...]
2. [title] [ [type] [size] ]
[line 2]
[teaser...] [more...]
[collection]
[congress number]
[document type]
[document version]
[_________]
[________v]
from [___] to [___]
[_________]
GPO Fdsys – Search Engine Configuration
July 28, 2017
Mapping to the Index Profile
FDsys
index profile
fields:
index.xslt
metadata for
search results
resultsbundle
mods.xslt
fdsys.xml
metadata for
simple search
xml scope field
grank1-6
publishdate
title
metadata for navigation
FAST
Extractors
ACP Cache
content files
50
merge
collection
specific
standard
navigators
body
GPO Fdsys – Search Engine Configuration
July 28, 2017
Collections and Codes
FDsys
• Three types of collections:
– Processing Collection = “collectionCode” or
“processingCode”
• Parsing, submission, processing, workflow
• Chooses which index.xslt and search.xslt to apply
– FAST Index Collection
• One for each processing collection
• Allows easily deleting all documents in a collection
– “Access Collection” = “accode”
• Re-group documents into collections for public users
• 98% the same as the “Processing Collection”
– Reports in the Congressional Record, FR Unified Agenda
• Mapping is done in index.xslt
51
GPO Fdsys – Search Engine Configuration
July 28, 2017
Simple Maintenance #1
FDsys
• New Collections
– Just add the collection (admin GUI)
– Start feeding data
• Add new fields
– Add the field to the index profile
– Reload profile with a hot-update
• Backups
– Turn off feeding
– Wait for documents in process to finish up
– Make index backups
52
GPO Fdsys – Search Engine Configuration
July 28, 2017
Simple Maintenance #2
FDsys
• Archiving Log Files
– Simple file copy, can happen any time
• Correct Field Mapping Errors
– Remove all documents in the FAST ESP collection
– Re-index collection from ACP Cache
• Get list of packages in the collection from Documentum
• Does not require re-export (or re-publish) the packages
from CMS
• Reorganize Access Collections
– Remove all documents from affected collections
– Re-index affected collections from ACP Cache
53
GPO Fdsys – Search Engine Configuration
July 28, 2017
Extensive Maintenance
FDsys
• Examples:
– New FAST Version
– New FDsys Version
– Complex index-profile changes (remove fields, major
restructuring)
– Re-organizing collections or field mapping while
maintaining searches on the old snap-shot
• Process:
– Servers to “stand-alone” mode
– Make changes
– Restore normal server operations
54
GPO Fdsys – Search Engine Configuration
July 28, 2017
Monitoring
FDsys
• FAST standard monitoring tool (“Clarity”)
• Monitors query and indexing performance
• Built-in alerting mechanism
55
GPO Fdsys – Search Engine Configuration
July 28, 2017
Backups
FDsys
• data_fixml
– Holds processed copy of the indexes
– Can be used to reconstruct the indexes in about 4
hours (will need to be benchmarked)
• data_index
– The complete indexes actually used for searching
• Configuration backup
• When restoring a backup:
– Will need to re-push all content updates which
occurred since the last backup
56
GPO Fdsys – Search Engine Configuration
July 28, 2017
Disaster Recovery Scenarios
FDsys
• Servers crash
– FAST restarts them automatically
• Hanging server processes
– Shut it down manually and restart it
• Incremental indexing overloads the system
– Should not happen in FDsys
– Can “slow down” incremental indexing until situation
is corrected
• Severe incremental indexing problems
– Revert to periodic batch index updates
57
GPO Fdsys – Search Engine Configuration
July 28, 2017
FDsys
Search Services API
58
GPO Fdsys – Search Engine Configuration
July 28, 2017
Component Interfaces
FDsys
FAST
indexes
FAST
59
Access
Search
API
Access
Search
Web App
ACP
Cache
Content
Delivery
Web App
GPO Fdsys – Search Engine Configuration
search results
collection browsing
browsing PDF
content-detail
July 28, 2017
Responsiblities: Search Services
vs Web Application
FDsys
Search Services API
Web Application
• All communication to
FAST
• Choosing which fields
when
– All FAST API calls
– All FAST parameters
• All FAST FQL
– User query strings and
parameters to FAST FQL
• Raw data values
– Allowed values, navigator
values, search results field
values, etc.
• Choosing Navigators
60
– Advanced Search Form
– Search Results
• User-interface oriented
field data
– Display names, help text,
display widgets
• Display value translation
– translating from raw data
values to/from user-friendly
values
GPO Fdsys – Search Engine Configuration
July 28, 2017
Responsiblities: Search Services
vs Web Application – Browse Trees
FDsys
Search Services API
Web Application
• Browse tree computation
• Browse tree presentation
– Selecting nodes to return
– Returning an ordered list of
nodes
– Caching search results
• Embargo Dates
61
– The definition of the levels
– How many levels to display
when
– Presentation of tree
– Translating raw data values
to user friendly values
• Content Detail Pages
GPO Fdsys – Search Engine Configuration
July 28, 2017
Component Interfaces
FDsys
Java method calls
HTTP
FAST
Search
Engine
FAST
Search
API
Parsing,
Processing &
Caching
Search
Services
API
Configuration
files (XML)
Master
Search
Web
Application
FR
CR
CFR
Collection Specific
CongBills
62
GPO Fdsys – Search Engine Configuration
July 28, 2017
Configuration File Contents
FDsys
• Fields
– for the Advanced Search Form
– for field: searches
– Allowed values
• Fixed enumerated list
• Enumerated list built from navigator
• Numeric or Date Range
• Navigators per collection
• FAST ESP Search Engine Connection Info
• Templates to reformat data for display
63
GPO Fdsys – Search Engine Configuration
July 28, 2017
Query Parsing: Syntax
FDsys
• atom
– Atoms are space separated lists of characters,
double-quoted strings, or parenthetical expressions
• atom atom
– defaults to AND
•
•
•
•
•
64
atom and atom
atom or atom
atom before/# atom
atom near/# atom
atom adj atom
•
•
•
•
•
•
not atom
+atom
-atom
field:atom
range(#,#)
range(<date>,<date>)
GPO Fdsys – Search Engine Configuration
July 28, 2017
Query Parsing: Examples
•
•
•
•
•
•
•
•
hearing
“congressional hearing”
congressional adj hearing
congressional hearing
ways and means
“ways and means”
ways “and” means
congressional or
congress
• congress and (report or
meeting or notice)
• congnum:range(103,110)
65
FDsys
• congressional not report
• congressional –report
• +cardin congressional
committee
• congresional not
(committee report)
• congressional not
(committee or meeting)
• representative near/10
cardin
• representative before/10
cardin
GPO Fdsys – Search Engine Configuration
July 28, 2017
Derived Hierarchy: Example
FDsys
• 110th Congress
– House Bills
• H.R. 1-200
• H.R. 201-400
– H.R. 201
– H.R. 202
– H.R. 203
• Engrossed in House
• Introduced in House
. Condemning the persecution of labor rights advocates in
Iran [PDF] [Text]
• Referred in Senate
– H.R. 204
• H.R. 401-600
66
GPO Fdsys – Search Engine Configuration
July 28, 2017
Specified Hierarchy: Example
67
GPO Fdsys – Search Engine Configuration
FDsys
July 28, 2017
Parser Foundation Classes
FDsys
PContainer
PParser
PPackage
XML DOM
PFile
PFile
PFile
XMLXML
DOM
XML
PGranule
PGranule
PGranule
XMLXML
DOM
XML
XSLT
append
append
xml
68
GPO Fdsys – Search Engine Configuration
July 28, 2017