XML Schema Integration

XML Schema Integration
Resources : Louise Lane & Kalpdrum Passi,
Sanjay Madria and Mukesh Mohania - “A
Model for XML Schema Integration”, and My
Research in Fall, 2001 with Dr. Madria
Contents










What is XML
Data Integration
Why business applications use XML
What is XML Schema
Different ways to integrate XML data
XML Schema Integration
XML Namespaces
Phases in Schema Integration
XML Schema Data Model
Graphical representation of the model
Contents






contd..
Conflicts resolution
Integration phase
Construction of Global schema
Advantages
Disadvantages
Conclusion
What is XML
XML is a markup language for documents containing
structured information.
A markup language is a mechanism to identify structures
in a document.
XML documents are self-describing, thus XML provides a
platform independent means to describe data and
therefore, can transport data from one platform to
another.
XML documents can be created and used by applications.
Data Integration
E-Commerce applications use data from different sources
and need to be integrated. A mediated schema is created
to represent a particular application domain and data
sources are mapped as views over the mediated schema.
Why Business applications use XML
Business applications needs to exchange data between
different applications.
The data should be transparent from representation and
should be platform independent.
XML is also used when one or more organizations merge.
When organizations merge, interoperability among
documents is necessary which can be achieved using XML
integration.
XML Schema
XML Schema is the recommended as the standard schema
language by W3C to validate documents.
XML Schema has a stronger expressive power than DTD
schema for the purpose of data exchange and integration
from various sources of data.
Different ways to integrate XML data
• Integrating XML documents
• Mapping of local schemas to global/integrated
schema if the global schema is known, or Querying
the data to obtain the required global schema.
• Integrating XML Schemas
Extracting Schema from XML Documents
Minimal Spanning graphs from different documents can be
extracted and the Schema can be constructed using these
graphs.
Heuristic rules are applied on the obtained spanning graphs
to construct the schema.
The paper “Re-engineering Structures from Web
Documents” – Chuang-Hue, Ee-Peng, and Wee-Keong deals
with constructing Schema in DTD for given XML documents.
Complexities in integrating XML
Documents
1. Need to extract the schema from the document.
2. Integrate the schemas obtained or perform mapping
from the individual schema documents to the global
schema if the global schema is already present.
3. Parse the XML documents and integrate the data
according to the global schema. Querying on XML
documents can be done to obtain the integrated
document.
Tukwila Data Integration System
Tukwila Data Integration system uses a mediated schema
to integrate data from different sources.
The user asks a query over the mediated schema and the
data Integration system reformulates the query over the
data sources and executes it.
Tukwila uses an Query Re-formulator and Optimizer to
query large amounts of data efficiently. MiniCon algorithm
is used to map the query from the mediated schema to
data sources.
It uses an x-scan operator that can query streaming XML
data.
Tukwila x-scan operator
To query an XML document, Querying techniques like XML-QL
and XQL needs the complete XML document to be downloaded
and is then queried.
Tukwila x-scan operator contd..
Tukwila X-scan matches regular path expression patterns from
the query, returning results in pipelined fashion as the data
streams across the network.
XML Schema Integration
The automated integration of XML schemas is beneficial to
both the traditional forms of view integration and database
integration.
An integrated schema forms the basis for a valid query
language over a particular set of XML documents.
The schemas to be integrated currently validate a set of
existing XML documents, data integrity and continued
document delivery are chief concerns of the integration
process.
XML Namespace
XML schema requires the use of namespaces to uniquely
identify schema structure ( elements, attributes,
datatypes, etc. ).
The name of each structure is prefaced by a namespace
prefix which identifies the namespace that the structure is
defined within.
A practical example of schema integration is when two
companies merge.
Documents and schemas of the companies that
merge
<?xml version="1.0" ?>
<gs_equipment
xmlns="http://www.GSE1example.org"
xmlns:xsi="http://www.w3.org/2000/10/XML
Schema-instance"
xsi:schemaLocation="http://www.GSE1examp
le.org GSE1.xsd">
<machine type=”baggage_handler”>
<supplier>Air to
Ground</supplier>
<serial_number>FRD6754</serial_number>
<service_agreement>
<expiry_date>01-01-2006</expiry_date>
</service_agreement>
<service_hours>345</service_hours>
</machine>
<location>
<airport>Vancouver</airport>
<terminal>6A</terminal>
</location>
</gs_equipment>
<?xml version="1.0" ?>
<gs_equipment
xmlns="http://www.GSE2.example.org"
xmlns:xsi="http://www.w3.org/2000/10/XMLSchemainstance"
xsi:schemaLocation="http://www.GSE2example.org
GSE2.xsd">
<placement>
<airport>Winnipeg</airport>
<terminal>main</terminal>
</placement>
<machine type=”tow_truck”>
<serial_number>123456145</serial_number>
<vendor>Quick as a Jet GSE</vendor>
<service_agreement>QJ-TT-123456145September 2003
</service_agreement>
<service_hours>1090.75</service_hours>
</machine>
</ge_equipment>
<?xml version="1.0"?>
<schema xmlns:xsd="http://www.w3.org/2000/10/XMLSchema"
targetNamespace="http://www.GSE1example.org"
elementFormDefault="qualified"
xmlns:GSE1="http://wwwGSE1example.org>
<element name ="gs_equipment">
<complexType>
<sequence>
<element ref="GSE1:machine" minOccurs="1" maxOccurs="1"/>
<element ref="GSE1:location" minOccurs="1" maxOccurs="1"/>
</sequence>
</complexType>
</element>
<element name ="machine”>
<complexType>
<sequence>
<element name="supplier" type="xsd:string" minOccurs="1" maxOccurs="1" />
<element name="serial_number" type="xsd:string" minOccurs="1"
maxOccurs="1" />
<element ref=”GSE1:service_agreement" minOccurs="1" maxOccurs="1" />
<element name="service_hours" type="xsd:integer" minOccurs="0"
maxOccurs="1" >
<xsd:attribute name="type" use="required">
<xsd:simpleType>
<xsd:restriction base="xsd:string">
<xsd:enumeration value="baggage_handler"/>
<xsd:enumeration value="boarding_stairs"/>
</xsd:restriction>
</xsd:simpleType>
</xsd:attribute>
<sequence>
</complexType>
</element>
<element name ="service_agreement”>
<complexType>
<sequence>
<element name="expiry_date" type="xsd:date" minOccurs="1" maxOccurs="1" />
</sequence>
</complexType>
</element>
<element name ="location">
<complexType>
<sequence>
<element name="airport" type="xsd:string" minOccurs="1" maxOccurs="1" />
<element name="terminal" type="xsd:string" minOccurs="1" maxOccurs="1" />
</sequence>
</complexType>
</element>
</schema>
<?xml version="1.0"?>
<schema xmlns:xsd="http://www.w3.org/2000/10/XMLSchema"
targetNamespace="http://www.GSE2example.org"
elementFormDefault="qualified"
xmlns:GSE2="http://wwwGSE2example.org>
<element name ="gs_equipment”>
<complexType><sequence>
<element name="GSE2:placement” minOccurs="1" maxOccurs="1“ />
<element ref="GSE2:machine" minOccurs="0" maxOccurs="1"/>
</sequence></complexType>
</element>
<element name="placement">
<complexType><sequence>
<element name="GSE1:airport" minOccurs="1" maxOccurs="1" />
<element name="GSE1:terminal" minOccurs="1" maxOccurs="1" />
</sequence></complexType>
</element>
<element name ="machine">
<complexType>
<all>
<element name=”vendor” type=”xsd:string” minOccurs=”0” maxOccurs=”1”>
<element name="service_hours" type="xsd:decimal" minOccurs="0“
maxOccurs="1" >
<element name="serial_number" type="xsd:positiveInteger" minOccurs="0"
maxOccurs="1" />
<element name="service_agreement" type="xsd:string" minOccurs="0"
maxOccurs="1" />
</all>
<xsd:attribute name="type" use="optional">
<xsd:simpleType>
<xsd:restriction base="xsd:string">
<xsd:enumeration value="baggage_handler"/>
<xsd:enumeration value="tow_truck"/>
</xsd:restriction>
</xsd:simpleType>
</xsd:attribute>
</complexType>
</element>
</schema>
An object-oriented data model that is called as XSDM (
XML Schema Data Model ) is defined.
A three-layered architecture consisting of pre-integration,
comparison and integration is used for the integration.
A global schema must meet the following criteria:
completeness, minimality and understandability.
Optionality of elements is expanded to meet boundary
restrictions.
Three Phases of integration
Pre-Integration: In this phase element, attribute and datatype
definitions are extracted through parsing the actual schema
document.
Comparison: In this phase, the correspondences between
elements and attributes are determined either by using
semantic learning or using human interaction.
Integration: In this phase, conflicts that exist between the
corresponding elements and/or attributes such as naming
conflicts, datatype conflicts and structural conflicts are
resolved.
XML Schema Data Model (XSDM)
Basically four structures are defined – Node Object, Child Object,
Datatype Object and Attribute Object.
Node Object : Represents an element, which may be either non-terminal
or terminal. Each node represents another set of structures that define
the node – Name, Namespace, Attribute, Datatype, Substitution Group
Name, Child list and Node Type which has six types – terminal,
sequence, choice, all, any or empty.
Child Object : Represents an element, which is a part of childList. Each
child has structures that define itself – Name, namespace, Max
Occurances, and Min Occurances.
XML Schema Data Model (XSDM) contd..
Datatype Object : Represents datatype of elements and attributes.
The structures that define this are Name, Variety(atomic, union, list),
Kind(43 simple and derived datatype), and Constraining Facets.
Attribute Object : Represents attributes associated with a nonterminal or terminal element. The structures that define an attribute –
Name, Namespace, Use, DataType, and value(default value).
Graphical Representation of XML
Schemas
Graphical representation of sample
schema for GSE1
Graphical representation of sample
schema for GSE2
Conflict Resolution
Naming Conflicts:
Synonym Naming Conflict: Different names but same defination. Solved using
substitution group names.
Homonym Naming conflict: Same name but different structure. Homonym
conflicts at Non-terminals are called structural conflicts and at terminals are
called datatype conflicts.
Conflict Resolution contd..
Datatype & scale differences:
Disjoint or incompatible datatypes – union
E.g. String, integer
Compatible datatypes – scale adjustment
E.g. Integer, float
Enumerated datatype – taking set of all the enumerations
E.g. {a,b}, {b,c} => {a,b,c}
Scale differences – constraint facet redefinition
Conflict Resolution contd..
Structural Conflicts:
Type Conflicts: Terminal in one schema and non-terminal in another
schema – Add both to the global schema.
Key conflicts:
If both schemas have their individual keys, then the global schema’s key
should be a composite of both the keys.
If an element is declared as key in one schema and as a non-key in other
schema, a complete knowledge of the data present in the documents is
required.
If the same element is declared as key in both the schemas, a prefix can
be added to the keys to make the key elements unique globally.
Integration phase
1. Constructing correspondences table
2. Constructing dependencies table
Correspondences table contain the information about
the corresponding elements/attributes.
An entry in the Dependencies table denotes the
dependency of an element on other elements/attributes.
The elements/attributes are integrated only after their
dependencies are integrated.
Graphical representation of Global schema
obtained
Construction of the Global schema Document
Once the integration process is completed, the global schema
in XSDM notation is used to construct the global XML schema
document.
The construction of the XML schema document is a straightforward process because all the data about the schema is
present in the XSDM notation.
Global schema document
<?xml version="1.0"?>
<schema xmlns:xsd="http://www.w3.org/2000/10/XMLSchema"
targetNamespace="http://www.GSEMexample.org"
elementFormDefault="qualified"
xmlns:GSEM="http://wwwGSEMexample.org
xmlns:GSE2="http://wwwGSE2example.org >
<element name ="gs_equipment”>
<complexType><choice>
<sequence>
<element ref="GSEM:machine" minOccurs="1" maxOccurs="1"/>
<element ref="GESM:location" minOccurs="1" maxOccurs="1" />
</sequence>
<sequence>
<element ref="GESM:location" minOccurs="1" maxOccurs="1" />
<element ref="GSEM:machine" minOccurs="0" maxOccurs="1"/>
</sequence>
</choice></complexType>
</element>
<element name ="machine">
<complexType>
<all>
<element name="supplier" type="xsd:string" minOccurs="0" maxOccurs="1" />
<element name="serial_number" type="serial_number_type" minOccurs="0" maxOccurs="1" />
<element ref=”GSEM:service_agreement" minOccurs="0" maxOccurs="1" />
<element ref=”GSE2:service_agreement” minOccurs=”0” maxOccurs=”1”/>
<element name="service_hours" type="decimal" minOccurs="0" maxOccurs="1" >
<element name="vendor" type="xsd:string" minOccurs="0" maxOccurs="1" >
</all>
<xsd:attribute name="type" use="optional">
<xsd:simpleType>
<xsd:restriction base="xsd:string">
<xsd:enumeration value="baggage_handler"/>
<xsd:enumeration value="boarding_stairs"/>
<xsd:enumeration value="tow_truck"/>
</xsd:restriction>
</xsd:simpleType>
</xsd:attribute>
</complexType>
</element>
Global schema document Contd..
<xsd:simpleType name=”serial_number_type”>
<xsd:union>
<xsd:string>
<xsd:positiveInteger>
</xsd:union>
</xsd:simpleType>
<element name ="service_agreement”>
<complexType><sequence>
<element name="expiry_date" type="xsd:date" minOccurs="1" maxOccurs="1" />
</sequence></complexType>
</element>
<element name ="location" substutionGroup =”GESM:placement”>
<complexType><sequence>
<element name="airport" type="xsd:string" minOccurs="1" maxOccurs="1" />
<element name="terminal" type="xsd:string" minOccurs="1" maxOccurs="1" />
</sequence></complexType>
</element>
<element name ="placement">
<complexType><sequence>
<element name="airport" type="xsd:string" minOccurs="1" maxOccurs="1" />
<element name="terminal" type="xsd:string" minOccurs="1" maxOccurs="1" />
</sequence></complexType>
</element>
</schema>
Advantages
This method is useful when a required global schema is not
present.
The global XML schema obtained is complete, minimal and
understandable.
Human interaction is required only for a limited level.
Even though local schemas are large and complex, the
global schema can be obtained efficiently.
Disadvantages
User interaction is required, cannot do the task by only
using semantic learning.
Not successful in resolving all key conflicts. Complete
knowledge on data is required to resolve these.
The method doesn’t have an cross check on the users
input. The process may result in a un minimal schema if
the user doesn’t recognize all the correspondences.
Conclusion
This method is successful in integrating schema
documents.
The method explained is implementable.