XML: The Universal Solvent, version 2.0

XML: The Universal Solvent,
version 2.0
George K. Thiruvathukal, Ph.D.
Loyola University C.S. Department
and Nimkathana Corporation
[email protected]
What is XML?
• eXtensible Markup Language
• a meta-language (in some ways)
– a language used to define languages
• successor to SGML
– a language aimed at addressing the same
needs, primarily aimed at document
management
• successor to HTML 4.0
– XHTML is HTML + XML
Issues
• File formats abound
–
–
–
–
word processors, spreadsheets, presentations
conflicting software versions - 6.0, 7.0
need for backward and forward compatibility
Office 11 purports to be XML-friendly
• HTML documents
–
–
–
–
numerous specifications
numerous implementations
extensibility and compatibility problematic
most browsers cannot display “proper” HTML
To extend or not to extend…
• HTML 4 put web “standards” to a severe test
– widespread agreement that further extensions may
cause HTML to come apart at the seams.
– HTML leaves too many stones unturned. Wellformedness criteria are not sufficient for real world
information modeling.
– The framework has already become too dependent on
desktop browser implementations.
• HTML does not address the notion of separating
concerns terribly well. It purports to be
– An information model
– A display model
• To do either of these well in one framework is
difficult. To do both is almost unthinkable.
So what else is it good for?
• In distributed systems, protocols proliferate much
like rabbits and fruit flies.
– “wire” protocols (e.g. FTP, HTTP, IIOP, COM,
DCOM) all use ad hoc binary formats to encode
message formats
– XML can be used as an all purpose encoder
– XML can also be used to perform validation
• Ad hoc data definition
– When relational and object modeling just isn’t enough.
Context: XML in Business
• B2B systems demand interoperability and
seamless data exchange
– XML supersedes EDI (Electronic Data Interchange)
without the complexity.
– XML also supersedes the arcane X.500 standard with
its ASN.1 encoding framework.
– XML may supersede database and RPC technologies.
XML/RPC and SOAP are gaining strength.
• The major client of interest is no longer the
desktop; palm, cellular, and embedded are the
words for the new millennium
– XML allows the information to be modeled
ubiquitously but rendered ad hoc using the XSLT
transformation framework.
The Road Ahead
• XML Well-Formedness Criteria
• XML Schema Definition Languages
– Document Type Definition (DTD)
– XML/Schema
•
•
•
•
DOM – Document Object Model
SAX – Simple API for XML Processing
XPath – XML Path Language
XSLT – XML Styling and Transformation
Language
There is much to learn with XML technology. This
presentation focuses on the DTD concept and a
preview of the other component frameworks.
Document Structure
Here are two ways to encode the notion of a
complex number, which has a real part and an
imaginary part, both of which are floating-point
numbers.
<complex>
<real>3.0</real>
<imaginary>4.0</imaginary>
</complex>
<complex real=“3.0” imaginary=“4.0”>
</complex>
<complex real=“3.0” imaginary=“4.0/>
You could come up with many valid encodings.
Well-formedness
• Tags must be balanced.
– <tag> … </tag>
• Tags must be properly nested.
– <tag1><tag2> … </tag2></tag1> is ok
– <tag1><tag2> … </tag1></tag2> is not ok
• Tags must always be matched with a closing tag.
– <hr> from HTML would not be ok on its own!
– <hr></hr> is ok
– <hr/> allows a tag to be opened and closed in one
construction.
So XML is…
• In part XML is about clarifying the use of tags
and what constitutes well-formedness.
• An XML document can be freely defined by
using any tags as you see fit.
• The minimum requirement is to follow the wellformedness rules.
• But to tap into XML fully, it is strongly
recommended to use a schema.
– XML’s innate schema definition language is called the
DTD (Document Type Definition).
Typical XML Document
<?xml version="1.0" encoding="iso-8859-1"?>
Prologue (English encoding)
<!DOCTYPE Friends SYSTEM "phonebook.dtd">
<Friends>
<Friend nick_name="george" number="773-555-1234">
Root
George K. Thiruvathukal
</Friend>
<Friend nick_name="nina" number="773-555-8899">
Nina Wilfred
</Friend>
<Friend nick_name="w" number="701-555-1111">
George W. Bush
</Friend>
<Friend nick_name="tc" number="888-555-9999">
Thomas W. Christopher
</Friend>
<Friend nick_name="chandra" number="888-555-9999">
Alok N. Choudhary
</Friend>
</Friends>
DTD (contained in phonebook.dtd)
Typical DTD (phonebook.dtd)
<?xml version="1.0" encoding="iso-8859-1"?>
<!ELEMENT Friends (Friend+)>
<!ELEMENT Friend (#PCDATA)>
<!ATTLIST Friend nick_name CDATA #REQUIRED>
<!ATTLIST Friend number CDATA #REQUIRED>
Structure
• Document Type Definition (DTD)
– can be used to define structure
• <!ELEMENT name (content-model)>
– name is the name of the tag you want to define
– content-model basically describes what can
appear in this tag.
• We’ll discuss the various possibilities
#PCDATA
• <!ELEMENT day (#PCDATA)>
– This defines a tag, day, that can only contain
parsed-character data
– Basically--no embedding of tags is possible in
this case
– <day>29</day> is ok.
– <day><day>29</day></day> is not ok.
Nesting
• <!ELEMENT date (month, day, year)>
– This defines a permissible ordering.
– A date, is a month, followed by day, followed
by year.
– You’d have to define “month” and “year”
similar to “day” in the previous slide
<date> <month> April </month>
<day>29</day><year>1967
</year></date>
Regular Expressions
• If you look at the DTD definition closely,
you will find that the rules basically
resemble regular expressions
–
–
–
–
–
a, b: concatenation
a | b: selection
a*: zero or more occurrences of a
a+: one or more occurrences == a, a*
a?: zero or one occurrences. This amounts to
the tag being “optional”
– (): grouping
More about #PCDATA
• When you want to make it possible to
intersperse any character data with the tags,
you can use #PCDATA as part of the DTD
rule.
• To make an appointment tag:
• <!ELEMENT appt (date | #PCDATA)+>
Example
<appt>
<date><day>August</day>
<month>15</month><year>1999</year></date>
Meet Joe Black
</appt>
Without the use of #PCDATA in the DTD rule,
“Meet Joe Black” cannot appear here.
Expanding Westward
<!ELEMENT apptbook (appt*)>
This allows you to start thinking about having
multiple appointments:
<apptbook>
<appt> <date> date here </date>
Appointment #1 </appt>
<appt> <date> date here </date>
Appointment #2 </appt>
</apptbook>
Attributes
• Annotating a tag without always having to
embed tags. You have seen them before:
– <a href=“http://www.jhpc.org”>Hypertext</a>
• In the appt class, it would be nice to have a
priority tag (low, medium, or high). You
can do this with notations.
– <!ATTLIST appt priority
NOTATION (low, medium, high)>
New Appointments
<appt priority=“low”>
<date> usual date stuff </date> Mozart
</appt>
<appt priority=“high”>
<date> usual date stuff></date> Helmut Epp
</appt>
<appt> <date> … </date> person
…</appt>
It is not required to use all (or any) of the
attributes that are defined in a DTD rule.
#REQUIRED vs. #IMPLIED
• #REQUIRED can be added to an attribute
definition to require its use.
– <!ATTLIST appt priority NOTATION (low, medium,
high) #REQUIRED>
• #IMPLIED means optional. It really means that
the application will supply a default value.
– <!ATTLIST appt priority NOTATION (low, medium,
high) #IMPLIED>
• By default, any unqualified definition is
#IMPLIED.
CDATA
• This allows you to specify an attribute that can
hold any character data.
– <!ATTLIST appt topic CDATA>
– <!ATTLIST a href CDATA>
• Character data is often needed, as in the case of
the href tag for HTML anchors.
– <a href=“http://xyz.com”>Hypertext</a>
• Attributes are essential for giving “options” to
tags. Otherwise, tag structure would be overly
complicated.
Tip of the Iceberg
• Element expressions can get quite complex. We
have covered some commonly useful cases.
• Enumerations and CDATA are common
attributes used in practice. Others exist:
– ID - to make a unique value document wide (anchors
in HTML really need this!)
– IDREF, IDREFS - used to point to an ID
– ENTITY, ENTITIES - external binary references
ID and IDREF
• ID is an identifier that is unique within a
document.
<!ATTLIST person id ID #REQUIRED>
This says that the person tag has an attribute id that is required and
must be unique within the document.
• IDREF is a way of referring to an ID (in the same
document) with the assurance that the reference is
valid.
<!ATTLIST person manager IDREF>
This says that the link tag has an attribute manager that must refer to
a unique ID within the same document.
ID and IDREF example
<person id=“gkt” manager=“bill”>
George K. Thiruvathukal
</person>
<person id=“bill”>
Bill Clinton
</person>
This example shows how to build a linked document
structure. “gkt” and “bill” are unique identifiers in this
document. A validating parser will check these.
What’s a DTD good for,
anyway?
• DTD is only useful in the presence of a
validating parser.
– XML parser - checks well-formedness of XML
only
– Validating XML parser - checks the above +
the DTD
• DTD can only be used to define structure
(syntax) and not semantics.
Semantics Problem
• Semantics is a language-theory word for meaning
(behind the syntax)
– Biggest language challenge: Don’t accept programs
(syntactically) that don’t have semantics to support the
syntax:
• example: *1 = *1 + *2 is valid C code that isn’t necessarily
meaningful.
• The XML DTD does not address semantics. It is
mostly an infrastructure for defining syntax and
document structure.
A Semantics Example
<date>
<month>April</month>
<day>31</day>
<year>1999</year>
</date>
You really have little control of the semantics
when you use #PCDATA. On the other hand, it is
much easier to write the rules.
Would attributes work?
• Maybe but in a limited way
• <!ATTLIST date month NOTATION
(1, 2, 3, …, 12, January, …, December)>
<!ATTLIST date day NOTATION
(1, 2, 3, …, 31)>
<!ATTLIST date year (1, 2, …, 65536)>
• Problem (?)
An Infinitary Situation
• How many different ways to encode months?
– US: 1, 2, 3, …, 12; January…December
– Mexico: Enero, Deciembre
• Days?
– The problem is: Some months have 28, 29, 30, or 31
days. You would have to define a LOT of rules.
• Years
– Big problem. In the future, you’d have a Y64K
problem.
Setting Realistic Expectations
• XML does a good job:
– let’s you define your own structures
– let’s you validate those structures
• Elaborate semantics require processing to be done
outside of the XML framework.
• You can hack an XML parser to support ad hoc
validation rules.
– Structure (in fact) is most of the story. It makes it
possible for you to more easily write the ad hoc rules,
knowing that what you are considering will minimally
be well-formed.
Demonstration
• Using Python Language for XML
Processing
– Validating Phonebook Parser and Application
– Outlining Tool
– Transforming an XML document to HTML

Download Report

XML: The Universal Solvent, version 2.0

Paperzz.com

Your Paperzz