CS203 Programming Languages Survey Project Vesselin Diev [email protected] 12-01-04 Topic: Static Validation of Dynamically Generated XML Basic idea/ problem: XML documents are often generated dynamically by programs. For example, XHTML1 documents generated by interactive Web Services in response to requests from clients. In most cases there are no static guarantees that the generated documents are valid according to a DTD schema, and in fact many large commercial services generate documents that are in fact invalid. In this survey I look into the research that has been done in the area of providing static (compile-time) guarantees of the validity of dynamically generated XML documents, so that run-time errors are avoided. My survey is based on a series of papers by the BRICS (Basic Research in Computer Science) research group at the University of Aarhus, Denmark, published in the period of 2001 to 2003. I did not find any more recent publications on this topic which makes me believe that the ideas and techniques discussed here are still the latest in the area. 1. Introduction The syntax of an XML-based language is specified using a vocabulary of elements and attributes together with rules for constraining their use. There exists a variety of schema languages, such as DTD, XML Schema, or DSD2, allowing the syntax to be formalized. An XML document is valid relative to a given schema if all the syntactic requirements specified by the schema are satisfied in the document. The language L(S) of a schema S is the set of XML documents that are valid relative to S. In recent years, there has been a growing interest in static validation of dynamically constructed XML documents. This stems from the rapid uptake of XML as a means of data exchange on the Web, and from the expectation that XML will serve as basis for future database systems. The safety of such Web and database applications is often critical, and thus it is important to provide compile-time guarantees that they produce conforming XML documents with respect to given XML schemas. XHTML [4], for example, is a popular and widely used in interactive Web services where clients are people using browsers to interact with Web servers. Another application of Dynamic XML is in application-to-application Web services where clients are not human beings but general programs. It is clear that XML plays an important role in information 1 XHTML is an XML-based language. It consists of all elements in HTML combined with the syntax of XML. XHTML is widely used in interactive Web services. representation and transformation of data. Yet, existing general-purpose or domain specific languages fail to provide special support for XML transformations and at the same time perform static validation of the output. The goal of this research is to integrate XML into general-purpose programming languages (such as Java or C) in order to support high-level definitions of XML transformations and thereby make Web services development easier and safer. The report is organized as follows: In section 2, I describe briefly the typical approaches for dealing with dynamic XML and point out their shortcomings. Section 3 gives the evolution of the research done by BRICS in tackling the problem. I present the sequence of techniques developed for XML transformations and the static validation of their output, namely <bigwig> [1], JWIG [2], and Xact [5] (the first based on the C language, while the others are incorporated in Java). I outline the main ideas common to all techniques and then concentrate in more detail on the latest one, Xact, in Section 4. Special attention is given to the notion of performing a dataflow analysis of the programs generating XML using summary graphs, the core of the static analysis technique, in Section 5. Performance evaluation and experimental results are discussed in Section 6. Future work suggestions and critique are included in Section 7. 2. Typical Approaches and Their Drawbacks There are many approaches for defining XML transformations and XML dynamic document construction. They are normally divided into techniques for general-purpose programming languages and for domain-specific languages. 2.1 Techniques for general-purpose languages The approaches of representing XML data as strings or DOM trees, for example in Servlets, fit into the category of techniques for general-purpose languages. Building XML documents by concatenating string fragments is commonly used in the presentation layer of interactive Web services. This primitive approach does not assist the programmer in avoiding mismatching tags or improper escaping of special characters, and it does not support deconstruction of documents. There is no tool support for analyzing the program at compile-time to verify that the transformation output will not fail at run-time. XML is regarded as one homogeneous type without considering schemas. The processing is completely independent from the schema information, so, for example, a schema may contain the information that A elements cannot occur as children of B elements, but failed attempts to select an A child element of a B element in a program will not be detected until run-time. 2.2 Techniques for Domain-specific languages Domain-specific languages are tailor-made for specialized classes of tasks, such as XML transformations. The predominant domain-specific language for XML is XSLT. Although the formal expressive power of these languages does not exceed that of general-purpose languages, their advantages are typically considered to be the high levels of abstraction with language constructs and customized syntax that closely matches the concepts in the problem domain, and the specialized analyses for reasoning about the behavior of programs. Yet, schemas for the input and output languages are generally ignored by domain-specific languages and essentially no type-checking is performed. 3. BRICS Contribution The first project for developing a language specifically designed for interactive Web services was called <bigwig> (WIG stands for Web Interactive Generator). <bigwig> is a high-level domain-specific language based on C. HTML is introduced in <bigwig> as a built-in data type and HTML templates as first-class values that may be computed and stored in variables. An HTML template may contain named gaps that are placeholders for other HTML templates or for text strings. Such gaps may at runtime be plugged with concrete values. Since those values may themselves contain further gaps, this is a highly dynamic mechanism for building documents. A flow-sensitive type checker ensures that documents are used in a consistent manner: for instance, it is checked at compile-time that in every document being shown to the client, the input form fields in the document match the server code that receives the values. Furthermore, a unique program analysis ensures at compile time that only valid HTML 4.01 documents are ever constructed and sent to the clients at runtime. <bigwig> did not consider XML values yet, and did not support deconstruction of dynamically generated documents. The JWIG programming language, the descendent of <bigwig>, is a Java-based highlevel language for development of interactive Web services. It contains an advanced session model, a flexible mechanism for dynamic construction of XML documents, in particular XHTML, and a powerful API for simplifying use of the HTTP protocol and many other aspects of Web service programming. To support program development, JWIG provides a unique suite of highly specialized program analyses that at compile time verify for a given program that no runtime errors can occur while building documents or receiving form input, and that all documents being shown are valid according to the document type definition for XHTML 1.0. Xact is an extension of JWIG which provides a full integration of XML values and highly flexible operations for XML transformation into an existing high-level language, Java, and static guarantees of type safety of the transformations. In addition to JWIG, it also allows for deconstruction of XML documents using XPath2 [3] for selecting fragments of XML values. The approach for providing static guarantees in all these techniques is based on dataflow analysis using Summary Graphs, rather than type systems. Dataflow analysis works on control-flow graphs, which allows flow sensitivity, whereas type systems typically work on abstract syntax trees. 2 XPath is a domain-specific language for selecting certain fragments of an XML tree. 4. More on Xact Specifics Xact is based on the notion of XML templates, which are just sequences of XML trees containing named gaps. Formally, XML templates are derived by xml in the following grammar: xml : str | <name atts> xml </name> | <[g]> | xml xml atts : name="str" | name=[g] | atts atts |ε (character data) (element) (template gap) (attribute constant) (attribute gap) A special plug function is used to construct new templates by inserting existing templates into gaps in other existing templates. Two XML template values are combined into a larger XML template using the plug(exp1;g; exp2) operation. The result of the expression is a copy of exp1 with all occurrences of g named gaps replaced by copies of exp2. Gaps can appear as either template gaps or attribute gaps. Both templates and strings can be plugged into template gaps. Only strings can be plugged into attribute gaps. Smaller XML template can be obtained by deconstructing an existing XML template using special select and gapify operations, both based upon XPath. The select(exp;xpath) operation is used for splitting an XML template into a decomposed XML template. The xpath argument is an XPath location path, that when evaluated on exp results in a set of nodes S that are addressed by the location path (i.e. the elements, attributes, and character data sequences that are pointed to by the path). The result of the select expression is an XML template where the roots are copies of all the subtrees of exp that are rooted at nodes in S. The gapify(exp;xpath;g) operation is used for inserting extra gaps into an XML template, meaning that it cuts away certain fragments of the template and inserts gaps instead. Again, the xpath argument is an XPath location path, which when evaluated on exp gives a set S of nodes. The result of the expression is a copy of exp where all subtrees rooted at nodes in S are replaced by g named gaps. The operation that does the validation is called analyze and takes as an argument a DTD3 schema to which the analyzer verifies if the template is valid. There are a few other XML template operations implemented in the interface which description is omitted here. The major ones needed to convey the main ideas were described above. 3 DTD is a schema language for XML documents defining for each element the required and permitted child elements and attributes. 5. Static Validation Using Summary Graphs The focus of the analysis is to find and report errors at compile-time that might occur in the program at runtime. The analysis must therefore be conservative, meaning that it approximates the values of a program in such a way that if no errors are found, then surely no errors will occur in the program at runtime. Since it gives an approximate solution though, it might be the case that sometimes false errors are reported, meaning that an error is reported when in fact there is not one. This is not a huge problem, however, because it turns out that it does not happen often in practice. Also, certain properties must be ensured, such that all plug operations are consistent and that certain template values are valid with respect to a given schema. The figure below shows how for a given program an abstract flow graph, representing the flow of template values, is constructed. Then, a string analysis [7] is performed and the result is a set of regular languages, which conservatively approximates the string values that occur in the program. The flow graph and the regular languages are used for the summary graph analysis phase, which produces summary graphs that conservatively approximates the template values that occur in the program. Finally, the summary graphs are used for checking that the program is wellbehaved. The analysis result is either an indication that no errors can occur in the program at run-time, or a list of errors that will or might occur in the program at run-time. Given a program and all DTD schemas it refers to, a fixed number of sets and functions are used by all summary graphs that occur during the analysis: The sets E, A, and G contain the element names, attribute names, and gap names, respectively, that occur in the program and in the schemas. Let Nε , NΑ, NΤ , and NC be finite disjoint sets of element, attribute, template, and chardata nodes, respectively. The set of all nodes is N= Nε ∪ NΑ ∪ NΤ ∪ NC. A summary graph SG is then a structure: SG = (R; T; S; P) where: R ⊆ Nε ∪ NΤ is a set of root nodes, T ⊆ NΤ × G × ( Nε ∪ NC ∪ NΤ) is a set of template edges, S: NC ∪ NA Æ REG is a string edge map, P : G → 2 N A ∪ N T × 2 N A ∪ NT × Γ × Γ is a gap presence map. Γ = 2 {OPEN ,CLOSED } is the gap presence lattice whose ordering is set inclusion. The set REG is a finite family of regular languages over the Unicode alphabet and is obtained by separate analysis of string operations. The language L(SG) of a summary graph SG is the set of templates that can be obtained by unfolding the graph starting at a root node and plugging elements, templates, and strings into gaps according to the edges. A template edge (n1; g; n2) ∈ T informally means that n2 can be plugged into the g gaps in n1, and a string edge S(n) = L means that every string in L may be plugged into the gap in n. We need the gap presence map to determine where edges should be added when modeling plug operations, to model the removal of gaps with the close operation, to detect when plug operations may fail because the specified gaps have already been closed, and to model and check XPath evaluations. Given that P(g) = (p1; p2; p3; p4), let open(P(g)) = p1, removed(P(g)) = p2, tgaps(P(g)) =p3, agaps(P(g)) = p4. Informally, the open and removed components specify which nodes may contain open or removed g gaps, and tgaps and agaps describe the presence of template gaps and attribute gaps, respectively. The value OPEN means that the gaps may be present, and CLOSED means that they may be absent. The unfolding of a summary graph is defined by the following mapping: unfold ( SG ) = {d ∈ XML | ∃n ∈ R : SG | - n ⇒ unfold d , SG = ( R, T , S , P )} , where the unfolding relation ⇒ unfold is defined recursively in the structure of the summary graph by an inference system. For the sake of conciseness I do not give the inference system in this report. Unfolding a summary graph according to the inference rules gives a set of XML templates. This set is called the language of the summary graph, and it is denoted L(SG). More precisely, the language of a summary graph is defined as follows: L(SG) = unfold(SG) An example of a summary graph is illustrated below: This language of this summary graph is the set of ul lists with zero or more li items that each contains a string from some language L. The boxes represent element nodes, rounded boxes are template nodes, the circle is a chardata node, the dots represent potentially open template gaps. The dataflow analysis associates a summary graph with every XML variable and expression at every point of execution of the program. Again, since it is a conservative analysis unfold(SG) contains all XML templates that may occur at a program point in run-time. 6. Performance and Results Once the summary graphs are constructed, the analyze operation is invoked to do the validation of the summary graphs for the XML expression relative to the DTD. If the analyzer does not find an error it is guaranteed that all XML templates at that point are valid at run-time. If the analyzer finds an error, then a corresponding and meaningful message is reported. (Note: for a good example on Xact code and its analyzer performance, see the end of the ppt presentation file.) The theoretical worst-case complexity of the entire construction and analysis is shown to be O(n8) where n is the size of the program and the associated DTD schemas. Nevertheless, the analysis seems to be efficient for practical purposes as the following benchmarks show: Here, Lines is the total number of lines of the program, Input is the number of lines of the DTD schemas associated with the inputs, Output is the number of lines of the DTD schemas associated with the output XML, Time is reported in seconds, and False Errors means errors reported when in fact there were none (the drawback of the conservative approximation approach). The important thing to note is that the analyzer has found all actual errors which we corrected subsequently. 7. Critique and Suggestions for Future Work In order to make Xact useful for larger applications it is necessary to provide a better runtime implementation of the operations. The goal maybe would be to develop a special data structure for template values, which is optimal with respect to both time and space cost. Or to further refine the approximation in the solution using summary graphs. It is definitely a matter of time and more research for the worst-case complexity of the analysis to be brought down. Another future improvement could be to allow casting of XML templates to the language of a more complex and expressive schema language than DTD. This would allow a more precise validation of externally computed XML templates at runtime, and would, therefore, potentially increase precision of the analysis. It would also be worth exploring if the Xact system can be made into a stand-alone package for XML transformations in Java. This may include some changes to the command line interface, to the error reporting, and to the implementation of the analyzer in particular. 8. References [1] Claus Brabrand, Anders M_ller, and Michael I. Schwartzbach. The <bigwig> project. ACM Transactions on Internet Technology, 2(2), 2002. [2] Aske Simon Christensen, Anders Møller, and Michael I. Schwartzbach. Static analysis for dynamic XML. Technical Report RS-02-24, BRICS, May 2002. Presented at Programming Language Technologies for XML, PLAN-X, October 2002. [3] James Clark and Steve DeRose. XML path language, November 1999. W3C Recommendation. http://www.w3.org/TR/xpath. [4] Steven Pemberton et al. XHTML 1.0: The extensible hypertext markup language, January 2000. W3C Recommendation. http://www.w3.org/TR/xhtml1. [5] Christian Kirkegaard, Anders Møller, and Michael I. Schwartzbach. Static analysis of XML transformations in Java, May 2003. BRICS report series RS-03-19. [6] Christian Kirkegaard. Dynamic XML processing with static validation, May 2003. Master’s thesis, University of Aarhus, Denmark. [7] Aske Simon Christensen, Anders Møller, and Michael I. Schwartzbach. Precise analysis of string expressions. In Proc. International Static Analysis Symposium, SAS ’03, Jun
© Copyright 2026 Paperzz