Web services

Matching and Reuse of XML Schemas
1
Sample XML Schema
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="car">
<xs:complexType>
<xs:sequence>
<xs:element name="make" type="xs:string"/>
<xs:element name="model" type="xs:string"/>
<xs:element name="year" type="xs:string"/>
<xs:element name="color" type="xs:string"/>
<xs:element name="driver">
<xs:complexType>
<xs:sequence>
<xs:element name="first" type="xs:string"/>
<xs:element name="last" type="xs:string"/>
<xs:element name="license" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
2
What is XML schema matching
 Matching – identifying the relations among the
corresponding elements of two schemas
 e.g. customer/firstName <==> client/name/first
customer/name <==>
concatenate (client/name/first, client/name/last)
 Calculate the distance between two Schemas
 E.g., distance between customer.xsd and client.xsd is 0.67.
3
Why XML Schema matching
 From data integration point of view:
 Purpose: Automatically identifying corresponding elements between two
schemas
 Relevant works:
 Database schema matching/mapping, e.g., A. Doan, et al., Reconciling schemas of
disparate data sources: A machine-learning approach. SIGMOD, 2001
 Generic schema mapping, e.g., J. Madhavan, P. A. Bernstein, E. Rahm. Generic schema
matching with Cupid. VLDB, 2001.
 XML Schema matching. E.g. H. Do, E. Rahm. COMA A system for flexible combination of
schema matching approaches. VLDB 2002.
 From web service composition point of view
 e.g., matching the output type of one service with the input of another in
sequential composition
 From software reuse point of view:
 Purpose: Build XML Schema categories and search engines;
 Relevant works:
 Software component search: A Mili, R Mili, RT Mittermeir, A survey of software reuse
libraries, Annals of Software Engineering, 1998.
 Agent and service matching: Katia Sycara, Jianguo Lu, Matthias Klusch, Interoperability
among Heterogeneous Software Agents on the Internet, Technical Report CMU-RI-TR4
98-22, CMU.
What are the problems
 Modelling
 As graph
 As tree matching
 Node similarity
 Name, type, cardinality.
 Structure similarity
 Tree edit distance
 K. Zhang, D. Shasha. Simple fast algorithms for the editing distance
between trees and related problems. SIAM Journal of Computing, 1989.
5
Overview of our system
Modelling
Node Relations
Structural Relations
Name Relations
XML
Schema
XML
Schema
Name
Similarity
Node
Similarity
Structural
similarity
Results
retrieval
6
Three similarities
Name
Similarity
Node name
Node
Similarity
User-defined
data type
WordNet,
string matching
Hungarian method
Built-in
data type
Structural
Similarity
Cardinality
Compatibility
tables
Hierarchical
structure
Tree matching
algorithm
7
Modelling
Model schemas as trees
<xs:element name="driver" type="driverType"/>
<xs:attribute name="license" type="xs:string"/>
8
Modelling
schema
customerOrder
paper
shipping
address
billing
Model schemas as trees
reference
title author contents
date ship2Add
refNo
Reference
schema
Address_ca.xsd
customerOrder
Address_us.xsd
address
address
billing
street
date ship2Add
street province postcode
paper
Recursion
shipping
date bill2Add
province postcode
street
date bill2Add
Importing and Inclusion
state
zip
9
Information excluded in Modelling
 Related to elements or attributes
Model schemas as trees
 Default value, value range, unique, nullable…
 Related to structure
 Sequence
 All
 Choice
name
first last
name
last
first
10
Computing node similarity
 Computing name similarity with the help of:
Node similarity
 WordNet and its API
 String matching
 Hungarian method
 Add the similarity of other information
 Data type
 Minimum cardinality
 Maximum cardinality
11
Name similarity from token lists
 Tokenize names
 E.g. clientName -> client name
submittedReports -> submit report
Node similarity
 Similarity between two token lists
 Using Hungarian method for Weighted Bipartite Graph Matching
(WBGM)
customerDeliveryAddress
customer
vs.
clientRequiredShippingAddress
client
sim0,0
require
delivery
shipping
address
simi,j
address
12
Determine the structural relation
Structure similarity
Tree 1
Tree 2
13
Common substructure
Structure similarity
make
firstName
model
lastName
year
car
license
driver
color
make
model
first
car
driver
last
year
license
color
14
Approximate Common Structure
Structure similarity
make
firstName
model
lastName
year
car
license
driver
color
make
model
first
car
driver
last
year
license
color
15
Mappings in an ACS
Structure similarity
make
ACS1
model
year
car
ACS2
color
first (firstName)
driver
mACS1 = {(s1.car, s2.car),
(s1.make, s2.make),
(s1.year, s2.year),
(s1.color, s2.color)}
last (lastName)
mACS2 = {(s1.dirver, s2.driver),
(s1.fist, s2.firstName),
(s1.last, s2.lastName),
(s1.license, s2.license)}
license
16
Evaluation
 Criteria
 Matching outcomes
 Mappings
 Schema similarity
Evaluation
 Execution time
 Collected four groups of Schemas




Purchase orders used in COMA (5)
Large schemas from XML.org (86)
Schemas on hospitality domain (95)
Extract from WSDL (419)
17
Comparison with edit distance algorithm element
mapping on data group 1
Precision by method 1
Precision by method 2
Recall by method 1
Recall by method 2
Av
g
&5
5
Ta
sk 1
04
3&
4
Ta
sk 9
3&
5
Ta
sk 8
2&
4
Ta
sk 7
2&
3
Ta
sk 6
2&
5
Ta
sk 5
1&
4
Ta
sk 4
1&
3
Ta
sk 3
1&
2
Ta
sk 2
1&
Ta
sk 1
Evaluation
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
Method 1: our algorithm
Method 2: edit distance
18
Comparison with edit distance: schema similarity data
group 3 and 4
Top-k Precision
1.0
Evaluation
0.8
Top-3 Precision
0.6
Top-5 Precision
0.4
0.2
0.0
Method 1 on
Method 1 on
Method 2 on
Method 2 on
Schema group Schema group Schema group Schema group
3
4
3
4
Method 1: our algorithm
Method 2: edit distance
19
Comparison with edit distance: performance
on data group 2
Avg Matching Time 1
Avg Matching Time 2
250
150
100
50
-2
0k
16
k
-1
6k
14
k
-1
4k
12
k
-1
2k
10
k
10
k
8k
-
8k
6k
-
6k
4k
-
4k
2k
-
2k
1k
-
1k
0
0-
(seconds)
Evaluation
200
Input size (M*N)
Method 1: our algorithm
Method 2: edit distance
20
Comparison with COMA (Mapping)
COMA – 'All'
COMA – 'All+SchemaM'
Our algorithm
Evaluation
Precision
about 0.95
about 0.93
0.88
Recall
about 0.78
about 0.89
0.87
0.73
0.82
0.75
Overall
Overall is a measure that combines precision and recall. It
reflects the efforts of removing incorrect mappings and adding
missing ones.
21
Conclusion
 Scalable schema matching
 Wang Lian, David W. Cheung, Nikos Mamoulis, and Siu-Ming Yiu,
An Efficient and Scalable Algorithm for Clustering XML Documents
by Structure, TKDE, 2005.
 Subtyping
 Apply to web service matching
22
Web service synthesis
23
Web Service Composition
composition
 Composite web service: “service implemented by
combining the functionality provided by other web
services” –G. Alonso et al.
 Web service composition: the process of developing a
composite web service
 Approaches to web service composition:




Conventional programming languages, such as Java, C#;
Web service composition languages, such as BPEL;
Workflow, pi-calculus, petri net, automata…
Web service synthesis.
24
Web Service Synthesis
 BPEL and the like are still programming languages
 They describe exactly how to compose the web services.
 Web service synthesis
composition
 We describe what is the service. But don’t describe how to
implement it;
 We don’t even know what are the component services involved;
 The relevant services are discovered and invoked dynamically;
 The implementation is synthesized from the web service
specification, automatically.
 Program synthesis has a long history.
25
Web Service Synthesis
WS
Syntactic Specification (WSDL)
Semantic Specification (Datalog)
composition
WS2
WS1
Service Specification (WSDL/Datalog)
Service Implementation
WS
Service Implementation (BPEL)
26
Synthesis Example
Chapters
MetaSearchService
Service specification
Syntactic:
Interface definition defined by WSDL
composition
Semantic:
Q(ISBN, PRICE, TITLE, RATE) <Chapters(ISBN, PRICE),
Book1(TITLE, ISBN, AUTHOR),
Book2(ISBN, COMMENT, RATE).
Syntactic specification: …
Semantic Specification:
chapters(ISBN, PRICE, TITLE, AUTHOR) <Chapters(ISBN, PRICE), Book1(TITLE, ISBN,
AUTHOR).
amazon
Service Specification
Syntactic specification:
WSDL file
MetaSearchService
Implementation
??
Semantic Specification:
amazon(ISBN, PRICE, RATE, TITLE, AUTHOR) <Amazon(ISBN, PRICE),
Book1(TITLE, ISBN, AUTHOR),
Book2(ISBN, COMMENT, RATE).
Service Implementation
Java code, database
27
Generate the abstract implementation by query rewriting
MetaSearchService
Service specification
Syntactic:
Interface definition defined by WSDL
composition
Semantic:
Q(ISBN, PRICE, TITLE, RATE) <Chapters(ISBN, PRICE),
Book1(TITLE, ISBN, AUTHOR),
Book2(ISBN, COMMENT, RATE).
Chapters
Syntactic specification: …
Semantic Specification:
chapters(ISBN, PRICE, TITLE, AUTHOR) <Chapters(ISBN, PRICE), Book1(TITLE, ISBN,
AUTHOR).
amazon
Service Specification
Syntactic specification:
WSDL file
MetaSearchService Abstract
Implementation
Q(ISBN, PRICE, TITLE, RATE) <amazon(ISBN, PRICE, RATE, TITLE', AUTHOR'),
chapters(ISBN, PRICE0, TITLE, AUTHOR).
Semantic Specification:
amazon(ISBN, PRICE, RATE, TITLE, AUTHOR) <Amazon(ISBN, PRICE),
Book1(TITLE, ISBN, AUTHOR),
Book2(ISBN, COMMENT, RATE).
Service Implementation
Java code, database
28
Generate the Concrete Implementation
MetaSearchService
Service specification
Syntactic:
Interface definition defined by WSDL
Semantic:
Q(ISBN, PRICE, PRICE0, TITLE, RATE) <…
Chapters
Syntactic specification: …
Semantic Specification:
chapters(ISBN, PRICE, TITLE, AUTHOR) <Chapters(ISBN, PRICE), Book1(TITLE, ISBN,
AUTHOR).
composition
amazon
Service Specification
MetaSearchService Abstract
Implementation
Q(ISBN, PRICE, PRICE0, TITLE, RATE) <amazon(ISBN, PRICE, RATE, TITLE', AUTHOR'),
chapters(ISBN, PRICE0, TITLE, AUTHOR).
MetaSearchService Concrete
Implementation
Invoke amazon;
Invoke chapters;
Combine the output;
Syntactic specification:
WSDL file
Semantic Specification:
amazon(ISBN, PRICE, RATE, TITLE, AUTHOR) <Amazon(ISBN, PRICE),
Book1(TITLE, ISBN, AUTHOR),
Book2(ISBN, COMMENT, RATE).
Service Implementation
Java code, database
29
It is a lightweight approach…
composition
 Web services are restricted to be database queries or
functions that can be described by database queries or
Datalog;
 Semantic specification is Datalog instead of more powerful
specification mechanism employing ontology;
 Compositions are restricted to data composition instead of
full-blown process specification such as BPEL.
 All those choices are meant for the construction of a
practical web service synthesis system…
30
Mapping between Datalog and Web Services
 Database vendors also provide wrappers for web services
composition
 Behind a web service there is a SQL query that corresponds to the
web service;
 SQL defines the semantics of the web service.
 Major database vendors support the mapping between SQL and
Web service;
 We experimented with DB2WS.
Malaika, S. et al. DB2 and Web Services. IBM System Journal, 41(4), pp. 666685. 2002.
31
Generate the Abstract Implementation by Query
rewriting
Definition: Given a query Q and a set of views V. A
rewriting of Q using V is a query Q’ such that Q=Q’,
and Q’ refers to one or more views in V.
composition
Views:
V1T1,T2.
V2T2,T3.
Query:
Q  T1, T2, T3.
Rewriting 1:
Q V1, T3.
Rewriting 2:
Q  V1, V2.
32
Our query rewriting system
composition
33
Limitations of our approach
 Focus on database web services;
 Datalog is not expressive enough.
 Query rewriting in Description Logic, or OWL.
composition
 Assume the existence of global database schemas:
 Service providers need to provide the semantic definition of web
services in terms a global database schema;
 New service specification is also defined using the common schema
 Schema matching
34
Other threads
 Web service collection and clustering
 From UDDI, Crawler, Search engines such as Google
 Master thesis to be finished this summer
 Web service metrics
 Schema subtyping
 Based on regular tree grammar
 Master thesis to be finished this summer
 Bottom up web service composition
 Semantic web service
35
Service Oriented Architecture
Discovery
agency
publish
find
Requester
interact
Provider
36
Web service discovery
 Keywords search
 Based on IR techniques, such as vector space model
 Fast, but not accurate
 Signature matching
 Decide subtype relations between input and output of web services
 Used in service composition, to find composable web services
 Relaxed matching
 Approximate matching, allowing small deviations in both structure
and words/tags
 Semantic matching
 Matching functional requirements of web services
 Used in adaptive, autonomous systems
37