report

CS203 Programming Languages
Survey Project
Vesselin Diev
[email protected]
12-01-04
Topic: Static Validation of Dynamically Generated XML
Basic idea/ problem: XML documents are often generated dynamically by programs.
For example, XHTML1 documents generated by interactive Web Services in response to
requests from clients. In most cases there are no static guarantees that the generated
documents are valid according to a DTD schema, and in fact many large commercial
services generate documents that are in fact invalid. In this survey I look into the research
that has been done in the area of providing static (compile-time) guarantees of the
validity of dynamically generated XML documents, so that run-time errors are avoided.
My survey is based on a series of papers by the BRICS (Basic Research in Computer
Science) research group at the University of Aarhus, Denmark, published in the period of
2001 to 2003. I did not find any more recent publications on this topic which makes me
believe that the ideas and techniques discussed here are still the latest in the area.
1. Introduction
The syntax of an XML-based language is specified using a vocabulary of elements and
attributes together with rules for constraining their use. There exists a variety of schema
languages, such as DTD, XML Schema, or DSD2, allowing the syntax to be formalized.
An XML document is valid relative to a given schema if all the syntactic requirements
specified by the schema are satisfied in the document. The language L(S) of a schema S
is the set of XML documents that are valid relative to S.
In recent years, there has been a growing interest in static validation of dynamically
constructed XML documents. This stems from the rapid uptake of XML as a means of
data exchange on the Web, and from the expectation that XML will serve as basis for
future database systems. The safety of such Web and database applications is often
critical, and thus it is important to provide compile-time guarantees that they produce
conforming XML documents with respect to given XML schemas. XHTML [4], for
example, is a popular and widely used in interactive Web services where clients are
people using browsers to interact with Web servers. Another application of Dynamic
XML is in application-to-application Web services where clients are not human beings
but general programs. It is clear that XML plays an important role in information
1
XHTML is an XML-based language. It consists of all elements in HTML combined with the syntax of
XML. XHTML is widely used in interactive Web services.
representation and transformation of data. Yet, existing general-purpose or domain
specific languages fail to provide special support for XML transformations and at the
same time perform static validation of the output.
The goal of this research is to integrate XML into general-purpose programming
languages (such as Java or C) in order to support high-level definitions of XML
transformations and thereby make Web services development easier and safer.
The report is organized as follows: In section 2, I describe briefly the typical approaches
for dealing with dynamic XML and point out their shortcomings. Section 3 gives the
evolution of the research done by BRICS in tackling the problem. I present the sequence
of techniques developed for XML transformations and the static validation of their
output, namely <bigwig> [1], JWIG [2], and Xact [5] (the first based on the C language,
while the others are incorporated in Java). I outline the main ideas common to all
techniques and then concentrate in more detail on the latest one, Xact, in Section 4.
Special attention is given to the notion of performing a dataflow analysis of the programs
generating XML using summary graphs, the core of the static analysis technique, in
Section 5. Performance evaluation and experimental results are discussed in Section 6.
Future work suggestions and critique are included in Section 7.
2.
Typical Approaches and Their Drawbacks
There are many approaches for defining XML transformations and XML dynamic
document construction. They are normally divided into techniques for general-purpose
programming languages and for domain-specific languages.
2.1 Techniques for general-purpose languages
The approaches of representing XML data as strings or DOM trees, for example in
Servlets, fit into the category of techniques for general-purpose languages. Building
XML documents by concatenating string fragments is commonly used in the presentation
layer of interactive Web services. This primitive approach does not assist the programmer
in avoiding mismatching tags or improper escaping of special characters, and it does not
support deconstruction of documents. There is no tool support for analyzing the program
at compile-time to verify that the transformation output will not fail at run-time. XML is
regarded as one homogeneous type without considering schemas. The processing is
completely independent from the schema information, so, for example, a schema may
contain the information that A elements cannot occur as children of B elements, but failed
attempts to select an A child element of a B element in a program will not be detected
until run-time.
2.2 Techniques for Domain-specific languages
Domain-specific languages are tailor-made for specialized classes of tasks, such as XML
transformations. The predominant domain-specific language for XML is XSLT.
Although the formal expressive power of these languages does not exceed that of
general-purpose languages, their advantages are typically considered to be the high levels
of abstraction with language constructs and customized syntax that closely matches the
concepts in the problem domain, and the specialized analyses for reasoning about the
behavior of programs. Yet, schemas for the input and output languages are generally
ignored by domain-specific languages and essentially no type-checking is performed.
3.
BRICS Contribution
The first project for developing a language specifically designed for interactive Web
services was called <bigwig> (WIG stands for Web Interactive Generator). <bigwig> is a
high-level domain-specific language based on C. HTML is introduced in <bigwig> as a
built-in data type and HTML templates as first-class values that may be computed and
stored in variables. An HTML template may contain named gaps that are placeholders for
other HTML templates or for text strings. Such gaps may at runtime be plugged with
concrete values. Since those values may themselves contain further gaps, this is a highly
dynamic mechanism for building documents. A flow-sensitive type checker ensures that
documents are used in a consistent manner: for instance, it is checked at compile-time
that in every document being shown to the client, the input form fields in the document
match the server code that receives the values. Furthermore, a unique program analysis
ensures at compile time that only valid HTML 4.01 documents are ever constructed and
sent to the clients at runtime. <bigwig> did not consider XML values yet, and did not
support deconstruction of dynamically generated documents.
The JWIG programming language, the descendent of <bigwig>, is a Java-based highlevel language for development of interactive Web services. It contains an advanced
session model, a flexible mechanism for dynamic construction of XML documents, in
particular XHTML, and a powerful API for simplifying use of the HTTP protocol and
many other aspects of Web service programming. To support program development,
JWIG provides a unique suite of highly specialized program analyses that at compile time
verify for a given program that no runtime errors can occur while building documents or
receiving form input, and that all documents being shown are valid according to the
document type definition for XHTML 1.0.
Xact is an extension of JWIG which provides a full integration of XML values and highly
flexible operations for XML transformation into an existing high-level language, Java,
and static guarantees of type safety of the transformations. In addition to JWIG, it also
allows for deconstruction of XML documents using XPath2 [3] for selecting fragments of
XML values.
The approach for providing static guarantees in all these techniques is based on dataflow
analysis using Summary Graphs, rather than type systems. Dataflow analysis works on
control-flow graphs, which allows flow sensitivity, whereas type systems typically work
on abstract syntax trees.
2
XPath is a domain-specific language for selecting certain fragments of an XML tree.
4. More on Xact Specifics
Xact is based on the notion of XML templates, which are just sequences of XML trees
containing named gaps.
Formally, XML templates are derived by xml in the following grammar:
xml : str
| <name atts> xml </name>
| <[g]>
| xml xml
atts : name="str"
| name=[g]
| atts atts
|ε
(character data)
(element)
(template gap)
(attribute constant)
(attribute gap)
A special plug function is used to construct new templates by inserting existing templates
into gaps in other existing templates. Two XML template values are combined into a
larger XML template using the plug(exp1;g; exp2) operation. The result of the expression
is a copy of exp1 with all occurrences of g named gaps replaced by copies of exp2. Gaps
can appear as either template gaps or attribute gaps. Both templates and strings can be
plugged into template gaps. Only strings can be plugged into attribute gaps.
Smaller XML template can be obtained by deconstructing an existing XML template
using special select and gapify operations, both based upon XPath. The select(exp;xpath)
operation is used for splitting an XML template into a decomposed XML template. The
xpath argument is an XPath location path, that when evaluated on exp results in a set of
nodes S that are addressed by the location path (i.e. the elements, attributes, and character
data sequences that are pointed to by the path). The result of the select expression is an
XML template where the roots are copies of all the subtrees of exp that are rooted at
nodes in S. The gapify(exp;xpath;g) operation is used for inserting extra gaps into an
XML template, meaning that it cuts away certain fragments of the template and inserts
gaps instead. Again, the xpath argument is an XPath location path, which when evaluated
on exp gives a set S of nodes. The result of the expression is a copy of exp where all
subtrees rooted at nodes in S are replaced by g named gaps.
The operation that does the validation is called analyze and takes as an argument a DTD3
schema to which the analyzer verifies if the template is valid.
There are a few other XML template operations implemented in the interface which
description is omitted here. The major ones needed to convey the main ideas were
described above.
3
DTD is a schema language for XML documents defining for each element the required and permitted
child elements and attributes.
5. Static Validation Using Summary Graphs
The focus of the analysis is to find and report errors at compile-time that might occur in
the program at runtime. The analysis must therefore be conservative, meaning that it
approximates the values of a program in such a way that if no errors are found, then
surely no errors will occur in the program at runtime. Since it gives an approximate
solution though, it might be the case that sometimes false errors are reported, meaning
that an error is reported when in fact there is not one. This is not a huge problem,
however, because it turns out that it does not happen often in practice. Also, certain
properties must be ensured, such that all plug operations are consistent and that certain
template values are valid with respect to a given schema.
The figure below shows how for a given program an abstract flow graph, representing the
flow of template values, is constructed.
Then, a string analysis [7] is performed and the result is a set of regular languages, which
conservatively approximates the string values that occur in the program. The flow graph
and the regular languages are used for the summary graph analysis phase, which produces
summary graphs that conservatively approximates the template values that occur in the
program. Finally, the summary graphs are used for checking that the program is wellbehaved. The analysis result is either an indication that no errors can occur in the
program at run-time, or a list of errors that will or might occur in the program at run-time.
Given a program and all DTD schemas it refers to, a fixed number of sets and functions
are used by all summary graphs that occur during the analysis: The sets E, A, and G
contain the element names, attribute names, and gap names, respectively, that occur in
the program and in the schemas. Let Nε , NΑ, NΤ , and NC be finite disjoint sets of element,
attribute, template, and chardata nodes, respectively. The set of all nodes is N= Nε ∪ NΑ
∪ NΤ ∪ NC.
A summary graph SG is then a structure:
SG = (R; T; S; P)
where:
R ⊆ Nε ∪ NΤ is a set of root nodes,
T ⊆ NΤ × G × ( Nε ∪ NC ∪ NΤ) is a set of template edges,
S: NC ∪ NA Æ REG is a string edge map,
P : G → 2 N A ∪ N T × 2 N A ∪ NT × Γ × Γ is a gap presence map.
Γ = 2 {OPEN ,CLOSED } is the gap presence lattice whose ordering is set inclusion. The set REG
is a finite family of regular languages over the Unicode alphabet and is obtained by
separate analysis of string operations.
The language L(SG) of a summary graph SG is the set of templates that can be obtained
by unfolding the graph starting at a root node and plugging elements, templates, and
strings into gaps according to the edges. A template edge (n1; g; n2) ∈ T informally
means that n2 can be plugged into the g gaps in n1, and a string edge S(n) = L means that
every string in L may be plugged into the gap in n.
We need the gap presence map to determine where edges should be added when
modeling plug operations, to model the removal of gaps with the close operation, to
detect when plug operations may fail because the specified gaps have already been
closed, and to model and check XPath evaluations. Given that P(g) = (p1; p2; p3; p4), let
open(P(g)) = p1, removed(P(g)) = p2, tgaps(P(g)) =p3, agaps(P(g)) = p4. Informally, the
open and removed components specify which nodes may contain open or removed g
gaps, and tgaps and agaps describe the presence of template gaps and attribute gaps,
respectively. The value OPEN means that the gaps may be present, and CLOSED means
that they may be absent.
The unfolding of a summary graph is defined by the following mapping:
unfold ( SG ) = {d ∈ XML | ∃n ∈ R : SG | - n ⇒ unfold d , SG = ( R, T , S , P )} ,
where the unfolding relation ⇒ unfold is defined recursively in the structure of the
summary graph by an inference system. For the sake of conciseness I do not give the
inference system in this report.
Unfolding a summary graph according to the inference rules gives a set of XML
templates. This set is called the language of the summary graph, and it is denoted L(SG).
More precisely, the language of a summary graph is defined as follows:
L(SG) = unfold(SG)
An example of a summary graph is illustrated below:
This language of this summary graph is the set of ul lists with zero or more li items that
each contains a string from some language L. The boxes represent element nodes,
rounded boxes are template nodes, the circle is a chardata node, the dots represent
potentially open template gaps.
The dataflow analysis associates a summary graph with every XML variable and
expression at every point of execution of the program. Again, since it is a conservative
analysis unfold(SG) contains all XML templates that may occur at a program point in
run-time.
6. Performance and Results
Once the summary graphs are constructed, the analyze operation is invoked to do the
validation of the summary graphs for the XML expression relative to the DTD. If the
analyzer does not find an error it is guaranteed that all XML templates at that point are
valid at run-time. If the analyzer finds an error, then a corresponding and meaningful
message is reported. (Note: for a good example on Xact code and its analyzer
performance, see the end of the ppt presentation file.)
The theoretical worst-case complexity of the entire construction and analysis is shown to
be O(n8) where n is the size of the program and the associated DTD schemas.
Nevertheless, the analysis seems to be efficient for practical purposes as the following
benchmarks show:
Here, Lines is the total number of lines of the program, Input is the number of lines of the
DTD schemas associated with the inputs, Output is the number of lines of the DTD
schemas associated with the output XML, Time is reported in seconds, and False Errors
means errors reported when in fact there were none (the drawback of the conservative
approximation approach). The important thing to note is that the analyzer has found all
actual errors which we corrected subsequently.
7. Critique and Suggestions for Future Work
In order to make Xact useful for larger applications it is necessary to provide a better
runtime implementation of the operations. The goal maybe would be to develop a special
data structure for template values, which is optimal with respect to both time and space
cost. Or to further refine the approximation in the solution using summary graphs. It is
definitely a matter of time and more research for the worst-case complexity of the
analysis to be brought down.
Another future improvement could be to allow casting of XML templates to the language
of a more complex and expressive schema language than DTD. This would allow a more
precise validation of externally computed XML templates at runtime, and would,
therefore, potentially increase precision of the analysis.
It would also be worth exploring if the Xact system can be made into a stand-alone
package for XML transformations in Java. This may include some changes to the
command line interface, to the error reporting, and to the implementation of the analyzer
in particular.
8. References
[1] Claus Brabrand, Anders M_ller, and Michael I. Schwartzbach. The <bigwig> project.
ACM Transactions on Internet Technology, 2(2), 2002.
[2] Aske Simon Christensen, Anders Møller, and Michael I. Schwartzbach. Static
analysis for dynamic XML. Technical Report RS-02-24, BRICS, May 2002. Presented at
Programming Language Technologies for XML, PLAN-X, October 2002.
[3] James Clark and Steve DeRose. XML path language, November 1999. W3C
Recommendation. http://www.w3.org/TR/xpath.
[4] Steven Pemberton et al. XHTML 1.0: The extensible hypertext markup language,
January 2000. W3C Recommendation. http://www.w3.org/TR/xhtml1.
[5] Christian Kirkegaard, Anders Møller, and Michael I. Schwartzbach. Static analysis of
XML transformations in Java, May 2003. BRICS report series RS-03-19.
[6] Christian Kirkegaard. Dynamic XML processing with static validation, May 2003.
Master’s thesis, University of Aarhus, Denmark.
[7] Aske Simon Christensen, Anders Møller, and Michael I. Schwartzbach. Precise
analysis of string expressions. In Proc. International Static Analysis Symposium,
SAS ’03, Jun