XML: The Universal Solvent, version 2.0 George K. Thiruvathukal, Ph.D. Loyola University C.S. Department and Nimkathana Corporation [email protected] What is XML? • eXtensible Markup Language • a meta-language (in some ways) – a language used to define languages • successor to SGML – a language aimed at addressing the same needs, primarily aimed at document management • successor to HTML 4.0 – XHTML is HTML + XML Issues • File formats abound – – – – word processors, spreadsheets, presentations conflicting software versions - 6.0, 7.0 need for backward and forward compatibility Office 11 purports to be XML-friendly • HTML documents – – – – numerous specifications numerous implementations extensibility and compatibility problematic most browsers cannot display “proper” HTML To extend or not to extend… • HTML 4 put web “standards” to a severe test – widespread agreement that further extensions may cause HTML to come apart at the seams. – HTML leaves too many stones unturned. Wellformedness criteria are not sufficient for real world information modeling. – The framework has already become too dependent on desktop browser implementations. • HTML does not address the notion of separating concerns terribly well. It purports to be – An information model – A display model • To do either of these well in one framework is difficult. To do both is almost unthinkable. So what else is it good for? • In distributed systems, protocols proliferate much like rabbits and fruit flies. – “wire” protocols (e.g. FTP, HTTP, IIOP, COM, DCOM) all use ad hoc binary formats to encode message formats – XML can be used as an all purpose encoder – XML can also be used to perform validation • Ad hoc data definition – When relational and object modeling just isn’t enough. Context: XML in Business • B2B systems demand interoperability and seamless data exchange – XML supersedes EDI (Electronic Data Interchange) without the complexity. – XML also supersedes the arcane X.500 standard with its ASN.1 encoding framework. – XML may supersede database and RPC technologies. XML/RPC and SOAP are gaining strength. • The major client of interest is no longer the desktop; palm, cellular, and embedded are the words for the new millennium – XML allows the information to be modeled ubiquitously but rendered ad hoc using the XSLT transformation framework. The Road Ahead • XML Well-Formedness Criteria • XML Schema Definition Languages – Document Type Definition (DTD) – XML/Schema • • • • DOM – Document Object Model SAX – Simple API for XML Processing XPath – XML Path Language XSLT – XML Styling and Transformation Language There is much to learn with XML technology. This presentation focuses on the DTD concept and a preview of the other component frameworks. Document Structure Here are two ways to encode the notion of a complex number, which has a real part and an imaginary part, both of which are floating-point numbers. <complex> <real>3.0</real> <imaginary>4.0</imaginary> </complex> <complex real=“3.0” imaginary=“4.0”> </complex> <complex real=“3.0” imaginary=“4.0/> You could come up with many valid encodings. Well-formedness • Tags must be balanced. – <tag> … </tag> • Tags must be properly nested. – <tag1><tag2> … </tag2></tag1> is ok – <tag1><tag2> … </tag1></tag2> is not ok • Tags must always be matched with a closing tag. – <hr> from HTML would not be ok on its own! – <hr></hr> is ok – <hr/> allows a tag to be opened and closed in one construction. So XML is… • In part XML is about clarifying the use of tags and what constitutes well-formedness. • An XML document can be freely defined by using any tags as you see fit. • The minimum requirement is to follow the wellformedness rules. • But to tap into XML fully, it is strongly recommended to use a schema. – XML’s innate schema definition language is called the DTD (Document Type Definition). Typical XML Document <?xml version="1.0" encoding="iso-8859-1"?> Prologue (English encoding) <!DOCTYPE Friends SYSTEM "phonebook.dtd"> <Friends> <Friend nick_name="george" number="773-555-1234"> Root George K. Thiruvathukal </Friend> <Friend nick_name="nina" number="773-555-8899"> Nina Wilfred </Friend> <Friend nick_name="w" number="701-555-1111"> George W. Bush </Friend> <Friend nick_name="tc" number="888-555-9999"> Thomas W. Christopher </Friend> <Friend nick_name="chandra" number="888-555-9999"> Alok N. Choudhary </Friend> </Friends> DTD (contained in phonebook.dtd) Typical DTD (phonebook.dtd) <?xml version="1.0" encoding="iso-8859-1"?> <!ELEMENT Friends (Friend+)> <!ELEMENT Friend (#PCDATA)> <!ATTLIST Friend nick_name CDATA #REQUIRED> <!ATTLIST Friend number CDATA #REQUIRED> Structure • Document Type Definition (DTD) – can be used to define structure • <!ELEMENT name (content-model)> – name is the name of the tag you want to define – content-model basically describes what can appear in this tag. • We’ll discuss the various possibilities #PCDATA • <!ELEMENT day (#PCDATA)> – This defines a tag, day, that can only contain parsed-character data – Basically--no embedding of tags is possible in this case – <day>29</day> is ok. – <day><day>29</day></day> is not ok. Nesting • <!ELEMENT date (month, day, year)> – This defines a permissible ordering. – A date, is a month, followed by day, followed by year. – You’d have to define “month” and “year” similar to “day” in the previous slide <date> <month> April </month> <day>29</day><year>1967 </year></date> Regular Expressions • If you look at the DTD definition closely, you will find that the rules basically resemble regular expressions – – – – – a, b: concatenation a | b: selection a*: zero or more occurrences of a a+: one or more occurrences == a, a* a?: zero or one occurrences. This amounts to the tag being “optional” – (): grouping More about #PCDATA • When you want to make it possible to intersperse any character data with the tags, you can use #PCDATA as part of the DTD rule. • To make an appointment tag: • <!ELEMENT appt (date | #PCDATA)+> Example <appt> <date><day>August</day> <month>15</month><year>1999</year></date> Meet Joe Black </appt> Without the use of #PCDATA in the DTD rule, “Meet Joe Black” cannot appear here. Expanding Westward <!ELEMENT apptbook (appt*)> This allows you to start thinking about having multiple appointments: <apptbook> <appt> <date> date here </date> Appointment #1 </appt> <appt> <date> date here </date> Appointment #2 </appt> </apptbook> Attributes • Annotating a tag without always having to embed tags. You have seen them before: – <a href=“http://www.jhpc.org”>Hypertext</a> • In the appt class, it would be nice to have a priority tag (low, medium, or high). You can do this with notations. – <!ATTLIST appt priority NOTATION (low, medium, high)> New Appointments <appt priority=“low”> <date> usual date stuff </date> Mozart </appt> <appt priority=“high”> <date> usual date stuff></date> Helmut Epp </appt> <appt> <date> … </date> person …</appt> It is not required to use all (or any) of the attributes that are defined in a DTD rule. #REQUIRED vs. #IMPLIED • #REQUIRED can be added to an attribute definition to require its use. – <!ATTLIST appt priority NOTATION (low, medium, high) #REQUIRED> • #IMPLIED means optional. It really means that the application will supply a default value. – <!ATTLIST appt priority NOTATION (low, medium, high) #IMPLIED> • By default, any unqualified definition is #IMPLIED. CDATA • This allows you to specify an attribute that can hold any character data. – <!ATTLIST appt topic CDATA> – <!ATTLIST a href CDATA> • Character data is often needed, as in the case of the href tag for HTML anchors. – <a href=“http://xyz.com”>Hypertext</a> • Attributes are essential for giving “options” to tags. Otherwise, tag structure would be overly complicated. Tip of the Iceberg • Element expressions can get quite complex. We have covered some commonly useful cases. • Enumerations and CDATA are common attributes used in practice. Others exist: – ID - to make a unique value document wide (anchors in HTML really need this!) – IDREF, IDREFS - used to point to an ID – ENTITY, ENTITIES - external binary references ID and IDREF • ID is an identifier that is unique within a document. <!ATTLIST person id ID #REQUIRED> This says that the person tag has an attribute id that is required and must be unique within the document. • IDREF is a way of referring to an ID (in the same document) with the assurance that the reference is valid. <!ATTLIST person manager IDREF> This says that the link tag has an attribute manager that must refer to a unique ID within the same document. ID and IDREF example <person id=“gkt” manager=“bill”> George K. Thiruvathukal </person> <person id=“bill”> Bill Clinton </person> This example shows how to build a linked document structure. “gkt” and “bill” are unique identifiers in this document. A validating parser will check these. What’s a DTD good for, anyway? • DTD is only useful in the presence of a validating parser. – XML parser - checks well-formedness of XML only – Validating XML parser - checks the above + the DTD • DTD can only be used to define structure (syntax) and not semantics. Semantics Problem • Semantics is a language-theory word for meaning (behind the syntax) – Biggest language challenge: Don’t accept programs (syntactically) that don’t have semantics to support the syntax: • example: *1 = *1 + *2 is valid C code that isn’t necessarily meaningful. • The XML DTD does not address semantics. It is mostly an infrastructure for defining syntax and document structure. A Semantics Example <date> <month>April</month> <day>31</day> <year>1999</year> </date> You really have little control of the semantics when you use #PCDATA. On the other hand, it is much easier to write the rules. Would attributes work? • Maybe but in a limited way • <!ATTLIST date month NOTATION (1, 2, 3, …, 12, January, …, December)> <!ATTLIST date day NOTATION (1, 2, 3, …, 31)> <!ATTLIST date year (1, 2, …, 65536)> • Problem (?) An Infinitary Situation • How many different ways to encode months? – US: 1, 2, 3, …, 12; January…December – Mexico: Enero, Deciembre • Days? – The problem is: Some months have 28, 29, 30, or 31 days. You would have to define a LOT of rules. • Years – Big problem. In the future, you’d have a Y64K problem. Setting Realistic Expectations • XML does a good job: – let’s you define your own structures – let’s you validate those structures • Elaborate semantics require processing to be done outside of the XML framework. • You can hack an XML parser to support ad hoc validation rules. – Structure (in fact) is most of the story. It makes it possible for you to more easily write the ad hoc rules, knowing that what you are considering will minimally be well-formed. Demonstration • Using Python Language for XML Processing – Validating Phonebook Parser and Application – Outlining Tool – Transforming an XML document to HTML
© Copyright 2026 Paperzz