pr_20080510_heterogenous - College of Engineering | Oregon

Accommodating Data Heterogeneity
in ULS Systems
Christopher Scaffidi
Mary Shaw
Carnegie Mellon University
Problem: Data heterogeneity
among software elements in ULS systems
• Software elements:
– Created by autonomous stakeholders
– Differing data formats
– May switch to new formats without prior notice
• End-user programmers:
– Create particularly unreliable software elements
– “Mash up” (integrate) software elements
2
problem  approach  proof-of-concept
Example: Exchanging person names
John Smith today
Smith, John tomorrow – unexpected format!
unanticipated need for “glue code” to reformat
Lincolnshire MCC tomorrow – questionable!
need to validate data, maybe trigger fail-over
Similar issues for data from users,
external datasets, or the web.
3
problem  approach  proof-of-concept
Other examples of
data format heterogeneity
• Room Numbers
– NSH 3103 vs Newell Simon Hall 3103
• Stocks
– GOOG vs Google vs Google Corporation
• Address Lines
– 101 Main St. vs 101 MAIN STREET vs 101 Main Str.
• Phone Numbers
– 888-800-2030 vs +1 888 800 2030 vs (888) 800-2030
• State Names
– California vs CA vs Calif.
4
problem  approach  proof-of-concept
Insight: Exchange kinds of data
(rather than particular formats)
RAY TILL
(404) 555-1203
2 PITT ST
PGH, Penna.
MR. ART COR
282.303.4040
15 RED RUN RD.
MR. ART COR
pittsburgh PA
282.303.4040
15 RED RUN RD.
MR. ART COR
pittsburgh PA
282.303.4040
15 RED RUN RD.
ART COR
MR. ART COR
pittsburgh PA
303.4040
282.303.4040
ED RUN RD.
15 RED RUN RD.
MR. ART COR
burgh PA
pittsburgh PA
282.303.4040
MR. ART COR
15 RED RUN RD.
282.303.4040
pittsburgh PA
15 RED RUN RD.
R
MR. ART COR
MR. ART COR
pittsburgh PA
282.303.4040
282.303.4040
RD. 15 RED RUN RD.
15 RED RUN RD.
MR. ART COR
pittsburgh PA
pittsburgh PA
282.303.4040
15 RED RUN RD.
pittsburgh PA
Doe, Jane
+1 717 292 3030
88 Brooke Lane
PITTSBURGH
Pennsylvania
John Smith
303-202-3030
101 Main St.
Pittsburgh, PA
5
MR. ART COR
282.303.4040
15 RED RUN RD.
pittsburgh PA
JOHN SMITH
(303) 202-3030
101 MAIN ST
Pittsburgh, PA
problem  approach  proof-of-concept
Insight: Exchange kinds of data
(rather than particular formats)
• Needed: Metadata indicating a reusable abstraction for
validating and reformatting each kind of string-like data.
– “I am sending you a string that I call a ‘phone number’, and
here’s the code to validate it and reformat it”
6
problem  approach  proof-of-concept
Proof of concept:
Exchanging XML and HTML
• Data providers label XML/HTML nodes with a “tope”
– “This node is what I call a ‘phone number’, and here’s
where you can find code to validate and reformat it.”
• Each tope’s implementation is stored at a published URL
• On receiving data, a system
– Downloads the tope implementation
– Executes it to validate and put data into desired format
7
problem  approach  proof-of-concept
Sample code
XML
<!-- topesheet = http://softwaresurvey.cs.cmu.edu/topes.txt -->
<mydoc><whatever>
<tel>233-222-3040</tel><date>11-Jan-96</date>
<tel>(203)484-2030</tel><date>12/30/2007</date>
</whatever></mydoc>
TopeSheet
xpath:/mydoc/whatever/date{tope:url(http://www.w3c.org/topes/date_EN.xml);}
xpath:/mydoc/whatever/tel{tope:url(http://myserver.com/custom_tel.xml);}
Client Code
ItemLoader loader = ItemLoader.FromXml(xml);
ItemSet items = loader.Load("xpath:/*/tel");
List<String> values = items.FormatAs("+1 404 505 6060");
// overloaded methods let you override the topes and/or validate the data
8
problem  approach  proof-of-concept
Benefits of labeling strings with topes
• Systems can detect invalid inputs
• Software elements can use varying formats
– No explicit references to format identifiers
– No need for ontology consensus
• Topes are reusable for data in…
9
– XML nodes

Database tuples
– HTML tags

Webform fields
– Spreadsheet cells

…and more
problem  approach  proof-of-concept
Thank You…
• To Jeff Magee, Betty Cheng, Barbara Ryder, Margaret
Burnett, and others at ICSE 2007 for early feedback
• To NSF for funding
• To ULSSIS for this opportunity to participate
10