XPipe - An XML Processing Methodology

XPipe - An XML Processing
Methodology
XML SIG, NY USA
Feb 12, 2002
Sean McGrath
CTO
Propylon
XML SIG NY, Sean McGrath http://www.propylon.com
What is XPipe?
• It is an architecture / methodology
/framework for developing robust,
scaleable, manageable XML processing
systems.
• based on proven mechanical manufacturing
techniques. Specifically:
– The Assembly Line Principle
– Component assembly and component re-use
XML SIG NY, Sean McGrath http://www.propylon.com
What is XPipe?
• An open source project hosted on Sourceforge
– http://xpipe.sourceforge.net
• A contribution to the blossoming meme of using
pipeline based processing to tame the burgeoning
complexity of XML transformations
– (If you do not find XML transformation complicated,
you are not sufficiently well informed.)
– (And no, XSLT does not solve all your problems)
XML SIG NY, Sean McGrath http://www.propylon.com
What is XPipe?
• A way of thinking about systems that focuses on
structured dataflows rather than Object APIs
• It is also:
– A Scandinavian sewage treatment technology
– An exhaust pipe system for high performance engines
– A VT100 based strategy game for DECs VAX/VMS
Operating System
XML SIG NY, Sean McGrath http://www.propylon.com
Contents of this talk
•
•
•
•
The XPipe philosophy
Major functional elements
Some examples
The XGrid and Commoditized XML
Processing
• Some anticipated objections (and answers)
• Relationship to other technologies
XML SIG NY, Sean McGrath http://www.propylon.com
Contents of this talk
•
•
•
•
•
Current status
Current problems
Future plans
Some (contentious) musings
Something cold to drink
XML SIG NY, Sean McGrath http://www.propylon.com
XPipe Philosophy
• XML is all about (potentially) complex,
hierarchical data structures
XML SIG NY, Sean McGrath http://www.propylon.com
XPipe Philosophy
Cars are complex, hierarchical structures
Henry Ford’s Model T Ford Assembly Line – 1914
XML SIG NY, Sean McGrath http://www.propylon.com
XPipe Philosophy
Lunch is a complex, hierarchical structure
Lunch Assembly Line. NY, 2002
XML SIG NY, Sean McGrath http://www.propylon.com
XPipe Philosophy
We are complex, hierarchical structures
XML SIG NY, Sean McGrath http://www.propylon.com
XPipe philosophy
• What have these scenes got it common?
– Complex construction of cars, tuna melts and
tendons made possible and efficient through
• assembly line manufacturing
• re-usable component processes and component
materials
• Why not apply this approach to XML
“manufacturing”?
XML SIG NY, Sean McGrath http://www.propylon.com
XPipe philosophy
• Why does the assembly line approach work?
– Transformation task decomposition
– Re-usable transformation components
• Transformation decomposition is the key to
complexity management. Just ask:
– Henry Ford
– Herbert Simon (The Two Watchmakers – “The Architecture of
Complexity”)
– George Miller (7+/-2)
– Adam Smith (An Inquiry into the Nature And Causes of the
Wealth of Nations,1776)
– Any electrical or chemical engineer.
XML SIG NY, Sean McGrath http://www.propylon.com
XPipe philosophy
• Component re-use is the key to productivity
– Ask any form of engineer (electrical, chemical
etc.) apart from software engineers…
– Component re-use remains a holy grail in
software engineering
– XPipe is yet another attempt…
XML SIG NY, Sean McGrath http://www.propylon.com
XPipe philosophy
• A lot of data processing for the forseable future
will consist of XML to XML transformation
• A lot of non-XML data processing can consist of
XML to XML transformations with the addition of
top and tail transformations
• Mantra
– Get data into XML as quickly as possible
– Keep it in XML until the last possible minute
– Bring all your XML tools to bear on solving the data
processing problem XML SIG NY, Sean McGrath http://www.propylon.com
XPipe philosophy
Input
Output
XML
XML
Top Transformation
Non-XML
Input
Tail Transformation
Non-XML
Output
XML SIG NY, Sean McGrath http://www.propylon.com
XPipe philosophy
• The philosophy hinges on the fact that every
complex XML transformation can be broken down
into a series of smaller ones than can be chained
together
XML SIG NY, Sean McGrath http://www.propylon.com
XPipe philosophy
• Only so many ways to
re-arrange an XML
tree structure
• A finite number of
fundamental
transformations, from
which all higher order
transformations can be
derived
XML SIG NY, Sean McGrath http://www.propylon.com
XPipe philosophy
– Transformation Decomposition leads to
• a series of small, manageable, “stand alone”
problems with an XML input “spec” and an XML
output “spec”.
• Can build, test, use and then re-use these
transformation components
• Very team development friendly
• High cohesion, loose coupling – just like the
professor advised
XML SIG NY, Sean McGrath http://www.propylon.com
XPipe philosophy
• Pipeline approach means you can mix
‘n’match black-box components that
internally use whatever paradigm best
suited the problem
•
•
•
•
•
Lexical
SAX
DOM
XSLT
XDuce, Pyxie, Haskell, AF-NG…
XML SIG NY, Sean McGrath http://www.propylon.com
Sample XPipe
DB
/CMS
Character
Set Mods
Lexical
Add
Doctype
+ validate
+ strip doctype
Lexical
Re-arrange
Elements
Validation
DOM
Schematron/
Stats + FTP
RelaxNG/ Rhino
Jython
SQL
Replace
Java
XHTML
Generate
XSLT
XML SIG NY, Sean McGrath http://www.propylon.com
XPipe philosophy
• Assertion : developers would use a component
based approach to XML processing if they did not
have to write the plumbing (orchestration,
exception handling) themselves
– “Gee, this problem is complex. Maybe I’ll do it in
multiple stages! Gee, now I have to orchestrate the
stages somehow. Batch files/shell scripts/driver
program – all ugly and error prone. Maybe I’ll just
write a single program after all…”
XML SIG NY, Sean McGrath http://www.propylon.com
XPipe philosophy
• “Professional developers spend 50
percent of their time writing plumbing” –
Adam Bosworth
• XPipe aims to look after the plumbing
letting developers concentrate on the
interesting stuff
XML SIG NY, Sean McGrath http://www.propylon.com
Philosophy Summary
• Preambles
– Make things as complex as necessary but not more
complex than necessary
– Solve all the worlds problems – but only one at a time
– Don’t even think about performance until it is too late –
then it will look after itself
– Only increase complexity linearly w.r.t. functionality
and only in “elevator pitch sized” functionality quanta
XML SIG NY, Sean McGrath http://www.propylon.com
Philosophy Summary – 1#2
• Data processing == data transformation
w.r.t. time.
• XML is the current runaway winner in the
self-descriptive data stakes and a very good
QDDL (Quiescent Data Description
Language)
XML SIG NY, Sean McGrath http://www.propylon.com
Philosophy summary – 2#2
• Inside every complex XML transformation is a
sequence of simpler XML transformations trying
to get out – a Pipe
• Decomposed transformation = new
transformations + already componentized
transformations -> Component Reuse
• Inside every graph transformation (read
“workflow” or “business process model”) is a
combination of simple Pipes trying to get out
XML SIG NY, Sean McGrath http://www.propylon.com
XPipe Philosophy
Leveled architetecture – levels build on one another
but any level is usable independently of higher
levels
Out
Level 2 - XRigs
In
Out
Level 1 - XPipes
In
Out
Level 0 - XComponents
In
Out
XML SIG NY, Sean McGrath http://www.propylon.com
Major Functional Elements –
XComponents
In
Out
• Developed in any language that runs on the
Java Virtual Machine (Jython, Java, XSLT,
Rhino (JavaScript) etc.)
• All XComponents are standalone programs
of the form
– [Name] [InputXML] [OutputXML]
[ErrorXML] [Optional Args]
XML SIG NY, Sean McGrath http://www.propylon.com
Major Functional Elements XComponents
• XComponents described in XML form. An
XComponent consists of:
–
–
–
–
Metadata (keywords etc.)
Documentation
Pre and Post Conditions
Unit Tests (input,output XML stream pairs +
Pre/Post Conditions)
– Code (Java / Jython / XSLT / Exec)
XML SIG NY, Sean McGrath http://www.propylon.com
Major Functional Elements –
XPipes
In
Out
• A linear assembly of XComponents that together
achieve some useful transformation function
• Described in XML
–
–
–
–
Documentation
Metadata (keywords etc.)
Pre/Post conditions
Unit Tests (input,output XML stream pairs + Pre/Post
Conditions)
– References to XComponents (URIs) which are resolved
when the XPipe is installed/executed
XML SIG NY, Sean McGrath http://www.propylon.com
Major Functional Elements –
XRigs
Out
In
I
n
Out
• An assembly of XPipes that together achieve some
useful transformation function
• Described in XML
– Documentation
– Metadata (keywords etc.)
– Pre/Post conditions
– Unit Tests (input,output XML stream pairs + Pre/Post
Conditions)
– References to XPipes (URIs) which are resolved when
the XRig is installed/executed
XML SIG NY, Sean McGrath http://www.propylon.com
Major Functional Elements
• Unit Testers
– XComponent, XPipe and XRig level Test Harnesses
• Executives
– XComponent, XPipe and XRig level Execution
Environments (on-the-fly, disk install, compiled, web
service…)
– (Executing an Xcomponent is identical to executing an
XPipe of arity 1, is identical to executing an XRig of
arity 1…)
XML SIG NY, Sean McGrath http://www.propylon.com
Major Functional Elements
• Executives
– Uniprocessor Execution
• Executed on 1 CPU, possibly with separate threads
for each instantiated X*
– Multiprocessor Execution (Vapor)
• XML based protocol to implement “Job Shop” work
distribution over a P2P network (XJCL)
XML SIG NY, Sean McGrath http://www.propylon.com
Major Functional Elements –
XPipe Monitor (Vapor)
XML SIG NY, Sean McGrath http://www.propylon.com
Major Functionality Elements –
Miscellany (Vapor)
• Whizzy GUI Component and Pipe Editors
• XComponent Creators
– “Wrap” Java, XSLT etc. into XComponent compliant
XML, Ant build target
• XComponent Proxies – “pretend” to be a simple
XComponent but invoke some external
functionality – from Windows DLL to SOAP endpoint
• XPipe masquerading as XComponent – this could
be a very powerful paradigm
XML SIG NY, Sean McGrath http://www.propylon.com
Major Functionality Elements –
Miscellany (Vapor)
• Compilers / Packers
– Pack XPipes/XRigs into standalone XPipes/XRigs for
distribution (with or without an executive)
– Compile pure XSLT XPipe into a self contained translet
(self contained or as an XComponent)
• “Compile away”/optimize intermediate files via a
variety of tricks (Jackson Inversion, Java IO hook,
shadow marshalling etc.)
XML SIG NY, Sean McGrath http://www.propylon.com
Simple XComponent examples
• Fundamental Operation – Rename Element
– Rename
• Input : <foo>baz</foo>
• Output: <bar>baz</bar>
foo
bar
baz
baz
XML SIG NY, Sean McGrath http://www.propylon.com
Simple XComponent examples
• Fundamental Operation - Peel
• Input : <foo><bar>baz</bar></foo>
• Output: <foo>baz</foo>
foo
foo
bar
baz
baz
XML SIG NY, Sean McGrath http://www.propylon.com
Simple XComponent examples
• Compound Operation - Matryoshka
• Input:
– <foo><bar>baz</bar></foo>
• Output:
foo
bar
– <foo></foo><bar></bar>baz
foo
bar
baz
baz
XML SIG NY, Sean McGrath http://www.propylon.com
Simple XComponent examples
• KlingonCloak
– Input:
• <foo><bar>baz</bar></foo>
– Output:
– <tag name=“foo”><tag name=“bar”>baz</tag></tag>
foo
bar
tag
type=“foo”
tag type=“bar”
baz
baz
XML SIG NY, Sean McGrath http://www.propylon.com
Sample XComponents
• Once you start thinking in terms of Pipes –
components appear everywhere:
–
–
–
–
–
–
–
Regular fragmentations
Doctype changer
Namespace normalizer
Character set transcoder
Hash generator
Architectural Forms
RelaxNG/Schematron etc
• A validator can be thought of as a component in an XPipe that
mirrors its input on its output
XML SIG NY, Sean McGrath http://www.propylon.com
Sample XComponents
• Reading a file is an XML to XML
transformation
– <file>lewisscarrol.xml</file>
– <poem><line>Twas brillig, and the slithy
tomes, did gyre and gimbal in the
wave</line>…</poem>
XML SIG NY, Sean McGrath http://www.propylon.com
Sample XComponents
• Arithmetic is an XML to XML
transformation
– <expr>1 + 2</expr>
– <res>3</res>
XML SIG NY, Sean McGrath http://www.propylon.com
Sample XComponents
• Unix pipe utilities e.g. tr
– hello world
– HELLO WORLD
XML SIG NY, Sean McGrath http://www.propylon.com
Sample XComponents
• Conditionals are XML to XML
transformation “tee junctions” triggered by
XPaths
if XPath TRUE branch
In
if XPath
if XPath FALSE branch
XML SIG NY, Sean McGrath http://www.propylon.com
Validation as an XComponent
XML
A
Input
RelaxNG
Schematron
Jython/Java/JACL
XComponent
Validation
Log
XML
A’
Output
Error
XML SIG NY, Sean McGrath http://www.propylon.com
Some related open technologies
•
•
•
•
•
•
•
•
•
•
| - Unix Pipes
SAX Filters
TRAX
XBeans
Cocoon
axKit
Ant
JXTA
Translets
TupleSpaces
XML SIG NY, Sean McGrath http://www.propylon.com
The XGrid
• Grid Technologies – computational power
“on tap” (http://www.gridforum.org)
• The XGrid – computational power “on tap”
to execute XPipes/XRigs
XML SIG NY, Sean McGrath http://www.propylon.com
The XGrid
Out
In
Out
DMZ
XML SIG NY, Sean McGrath http://www.propylon.com
Some objections (with some
answers)
• It will be slow
Me at age 46
(Projected)
Speed of
De Spmodification
ve ee
lo d o
pm f
en
t
Me at age 36
Speed of
Execution
Me at age 26
The 3 Axes to Speed
– No it won’t Premature optimization
is the root of all evil!
– Speed is a three headed
monster. I’m old
enough to have left the
X axis and currently
heading for Y through
Z
XML SIG NY, Sean McGrath http://www.propylon.com
Some objections (with some
answers)
• It will be slow (cont.)
– Massive Parallelism will kill all von Neumann
throughput arguments
• Documents per second, not seconds per document –
throughput is the true measure of XML processing speed
• Document fulcra – Locality of reference (Denning) applies to
XML processing (more on this later)
– A myriad of “compile time” optimizations on XPipes
possible
– Keep the architecture simple – and speed will sort itself
out
XML SIG NY, Sean McGrath http://www.propylon.com
Some objections (with some
answers)
• Component based software? Harumph! We
have heard that one before…
– XPipe is data flow based not API based (COM,
VBX, CORBA). They payload is what is
important – not the plumbing
– Information integration (needed on the server
side)– not application integration (needed on
the client side)
XML SIG NY, Sean McGrath http://www.propylon.com
Document fulcra and the
scatter/gather pattern
• For any given task t to be performed on
documents conforming to schema s, there is
a fragment expression that can be used to
chop any document into n pieces on which t
can be performed independently
• These points are called fulcra and are a
function of (t,s)
XML SIG NY, Sean McGrath http://www.propylon.com
Document fulcra and
scatter/gather pattern
• Having identified the fulcra:– Chop the input document into fragments – scatter phase
– Perform t
– Join all the processed fragments together to constitute
the output document – gather phase
• Three stage XPipe – scatter & gather are (or more
accurately soon will be) standard XPipe
components
XML SIG NY, Sean McGrath http://www.propylon.com
Document Fulcra
Input
Doc
Scatter
TIME
n fragments
Invoke t
t
t
t
t
t
n fragments
Gather
Output
Doc
XML SIG NY, Sean McGrath http://www.propylon.com
Document Fulcra
• For data-oriented XML, the fulcra often coincide
with the “record” iteration in the XML schema and
may be independent of t.
• For document-oriented XML, the fulcra are much
more dependent on t.
• <Colloquial>A good fulcra based scatter/gather
will make performance head north faster, cheaper
and with a high upper limit than any amount of
hand-crafted, genius level XML coding of your
transformations.</Colloquial>
XML SIG NY, Sean McGrath http://www.propylon.com
The XSLT/DOM -> SAX nonsequiter
• XSLT and DOM are memory bound – trade
off between ease of use and resource usage
– ease of use favoured
• SAX is not memory bound – trade off
between ease of use and resource usage –
low resource usage favoured
• On xml-dev users often advised to rewrite
their apps using SAX! Ugh!
XML SIG NY, Sean McGrath http://www.propylon.com
XSLT/DOM -> XPipe
• XPipe and scatter/gather allow you to keep
the ease of use of XSLT/DOM with the
finite resource utilization of SAX
• As long as you can identify a good fulcrum
function
– They exist more often than not
– If they exist, they are very easily found
XML SIG NY, Sean McGrath http://www.propylon.com
Current status
• The philosophy is known to work
– Seven years agrowing in consulting company
(IDM 1995, Digitome)
– Uniprocessor XPipe used to develop
• 80-C pipe from Hub notation for a complex
document type to a legacy mainframe display
notation. 120 page spec.
• 20-C pipe for semantic validation of legislation
documents
XML SIG NY, Sean McGrath http://www.propylon.com
Current Status
• Version 0.6
• Schemas for XPipes and XComponents on
xpipe.sourceforge.net. – feedback required
• Sample components (Java/XSLT/Jython)
and some documentation
• Simple, illustrative XComponent and XPipe
uniprocessor executive
XML SIG NY, Sean McGrath http://www.propylon.com
Current Status
• Object model for XCompontents in Jython
+ Java (David Starr)
• Object model for Xpipes in Jython
• Execution, testing utilities in Jython
• Start of a NetBeans based XComponent
editor
XML SIG NY, Sean McGrath http://www.propylon.com
Current Status
• Uniprocessor XPipe used to develop
– 80-C pipe from Hub notation for a complex
document type to a legacy mainframe display
notation. 120 page spec.
– 20-C pipe for semantic validation of legislation
documents
– Xpipe and XComponent validators
XML SIG NY, Sean McGrath http://www.propylon.com
Current Status
• Some aspects of the XComponent model need
testing
– Parameters
– Exec XComponents
– Pre/Post condition checking
• This will be a point release in late Feb. Then focus
on developing the XComponent repository in
parallel with core dev.
• Scatter/Gather raises some interesting scheduling
issues currently being grappled with
• Balance between developer-hit and ease of
execution current in favour of low developer-hit
XML SIG NY, Sean McGrath http://www.propylon.com
Current Problems
• No GUI stuff and not enough documentation
• Everybody agrees that an XML document is a tree
but:
– The content and structure of the tree depends on the
parser
– The content and structure of re-generated XML (The
round-tripping problem)
– Roll on XML-SW!
XML SIG NY, Sean McGrath http://www.propylon.com
Current Problems
• Naming things
– Taxonomy of XTLs (XML Transformation
Languages)
– Taxonomy of re-usable XComponents and
XPipes
XML SIG NY, Sean McGrath http://www.propylon.com
Current Problems
• Flexible transformation scheduling is hard
• Optimal transformation scheduling is very
hard
• Calling all process engineers – help!
XML SIG NY, Sean McGrath http://www.propylon.com
Future Plans
• Evangelize the idea that DTD validated XML 1.0
is just Well Formed XML that has been through a
pipe consisting of:
–
–
–
–
–
A transclusion component (entity expansion)
A macro pre-processor (conditional marked sections)
An attribute decorator (implied/fixed attributes)
A grammar checker
…
XML SIG NY, Sean McGrath http://www.propylon.com
Valid XML
Well Formed XML
Paremeter Entity Expansion
Conditional Sections
General Entity Expansion
Attribute Decoration
Grammer Validation
Valid
XML
XML SIG NY, Sean McGrath http://www.propylon.com
Future plans
• When DOCTYPE goes away (which it
will), provide all DTD functionality as a set
of XComponents)
XML SIG NY, Sean McGrath http://www.propylon.com
Future Plans
• Getting to the point where we can grow the
XComponent repository is priority #1
• XRigs, XPipes, and XComponents as web services
(SOAP/XML-RPC, WSDL, UDDI etc.)
• Getting the P2P and Grid Technology
communities input into XGrid/XJCL
• See if a P2P execution environment for
XRigs/XPipes can be shortcircuited e.g. JXTA
• Getting help to develop the XPipe reference
implementation on Sourceforge
XML SIG NY, Sean McGrath http://www.propylon.com
Future Plans
• Development of commercial
implementations of XPipe integrated with
leading EAI systems (Ongoing)
• Use of SCADA tools to develop XPipe
process control and monitoring systems
• Use of UML tools to create XPipes and
XRigs using state transition diagrams
XML SIG NY, Sean McGrath http://www.propylon.com
Future Plans
• Use of Animation Engineering techniques
for CAXTE tools (Computer Aided XML
Transformation Engineering)
• Digging around swarm intelligence,
hierarchy theory, complexity theory, selfassembly, bio-informatics and
nanofabrication for concepts and tools
applicable to XML transformations
XML SIG NY, Sean McGrath http://www.propylon.com
In conclusion
• XPipe is simple
• Simplicity works!
• Plenty of evidence outside of XML
engineering that this approach will work
• Plenty of lore and tools from other fields of
science can be brought to bear to build
systems using the XPipe approach
XML SIG NY, Sean McGrath http://www.propylon.com
Musings #1 - Debugging
• XPipe is very debugging friendly
– log2(N) time required for fault diagnosis
– “Probes” in the form of loggers, RelaxNG validators,
easily plug-inable to a pipe to watch what is going on.
– Pre/Post condition on/off switch is a useful “design by
contract” debugger
– Unit testing at Rig, Pipe and Component level allows
layer at a time re-assembly after a fault has been fixed.
XML SIG NY, Sean McGrath http://www.propylon.com
Musings #2 – Inbetweening and
XComponent development
• Transformation analysts spec the transformation
• Only need to code new components
• Spec == XComponent or XPipe with doc, pre/post
etc. but no code
• Built in JIT-style acceptance test
• Outsource friendly and third-party market friendly
XML SIG NY, Sean McGrath http://www.propylon.com
Musing #3 - Web Services
• First generation will be a total blind alley –
RPC
• Document Oriented Messaging – not Object
Oriented Messaging – the next stage in
encapsulation and loose coupling –
something like XPipe will be a prerequisite.
XML SIG NY, Sean McGrath http://www.propylon.com
Musing #4 – Parametric Typing
of XComponents
• Numerous XComponents that do the same
thing, not necessarily duplication
– Space
– Time
– Infoset considerations
XML SIG NY, Sean McGrath http://www.propylon.com
Musing #5 – Pre-validation
Transformation
• Killing ourselves seeking one-shot expressivity in
schema validation languages
• Many complex validations become a lot simpler if
you do some transformation(s) first
– Co-occurrence constraints
– Contextual constraints
• Clear analog with formatting (pre-flow
transformation(s) + flow)
XML SIG NY, Sean McGrath http://www.propylon.com
Musing #6 – location, location,
location
• Abstraction 1: keep code and data on the
same high-speed bus – monolithic systems
• Abstraction 2: allow code to be downloaded
from the Web – sandbox required owing to
security issues
• Abstraction 3: leave the code ‘out there’ and
move the data – bandwidth issues and data
>> code
XML SIG NY, Sean McGrath http://www.propylon.com
Musing #6 – location, location,
location
• Monolithic – bad (have to “install” stuff
which is very 20th century)
• Sandbox – bad (the better the sandbox the
less useful the code running in it.)
• XGrid – Design as if data pulled by the
code (easy model) but DMZ the code + data
– the only thing the flows over the firewall
is the transformed data…
XML SIG NY, Sean McGrath http://www.propylon.com
Thank you
– http://xpipe.sourceforge.net
XML SIG NY, Sean McGrath http://www.propylon.com