MashQL: A Query-by

MashQL: A Query-by-Diagram Language
Technical Article
Marios D. Dikaiakos
Mustafa Jarrar
University of Cyprus
[email protected]
University of Cyprus
[email protected]
ABSTRACT
This article is motivated by the massively increasing structured
data on the web (data web), and the need for novel methods to
expose this data to its full potential. Building on the remarkable
success of Web 2.0 mashups, and specially Yahoo Pipes, we
generalize the idea of mashups and regard the Internet as a
database. Each internet data source is seen as a table, and a
mashup is seen as a query on these tables. We assume that web
data sources are represented in RDF, and SPARQL is the query
language.
We propose a mashup language, called MashQL, which allows
people to mash up data intuitively. In the background, MashQL
queries are translated into and executed as SPARQL queries. The
novelty of MashQL is that it allows one to query (and navigate)
an RDF graph without any prior knowledge about its structure or
technical details; as well as it supports pipelines and materialized
queries as built-in concepts. Users also do not need any
knowledge about RDF/SPARQL to get started.
Although we focus on RDF/SPARQL mashups, but our approach
can be easily reused for other data formats and query languages.
Keywords
Query-by-diagram, Query Formulation, Data Mashup, Semantic
Mashup, Semantic Web, Data Web, Linked Data, Web 2.0
Table of Contents
1. Background and Motivation ........................ 1 2. Related Work and Contributions ................. 3 3. The Basics of MashQL .................................. 4 4. Formal Definition of MashQL ...................... 5 5. Query Pipelines ........................................... 8 6. Implementation ........................................... 8 7. Use Cases................................................... 11 8. Discussion .................................................. 13 9. Evaluation .................................................. 14 Appendices ...................................................... 16 Remark: This is a live report, updated regularly (every week/month) to include our latest developments on all aspects related to
MashQL (its formal syntax and semantics, implementation, use cases, evaluation, etc.). The content of this article is not published
yet, thus if you wish to cite this work, please use the reference below. Any publication stemming from this report will be listed here.
Reference: Jarrar M, Dikaiakos M: MashQL: A query-by-diagram. Technical Article TAR200805. University of Cyprus, 2008
Download from: http://www.jarrar.info/mashql/TA/
1. Background and Motivation
The rapid growth of Web 2.0 content has created a high demand
for making this content more reusable. Companies are competing
not only on gathering more content and contributions from
people, but also on making their content available for using inside
their own websites. Many companies such as Google, Yahoo,
Microsoft, Amazon, eBay, LinkedIn, and Wikipedia, have made
their content publicly accessible through APIs. People are
encouraged to make their own applications and profit based on
others’ content. For example, one can build a program to access
the content of the Craigslist real-state database to find apartments
in certain area, mix this content with location information from
Google Maps, and provide a new web service that was not
originally provided by either source. Another web service can be
created to find the events happening in a city, by integrating the
content of several event databases (such as Upcoming, and
Google Base), mix the results with relevant photos from Flickr,
and render the final results on Yahoo Maps. Web applications that
consume content originated from third parties and retrieved via a
public interface or API are called Mashups.
To expose the massive amount of public content and to allow
people to build mashups easily, several mashup editors have been
launched, including Google Mashup, Microsoft’s Popfly, IBM’s
Smash, Yahoo Pipes, and few others. Yahoo Pipes have received
the greatest attention thanks to their simplicity. Yahoo Pipes
allow people to combine different data sources into mashups, in a
graphical and user-friendly way without having to write code.
Yahoo Pipes generalize the idea of the mashup, providing a drag
and drop editor that … can easily fetch data from any data source
providing an RSS, Atom or RDF feed, extract the data the user
wants, combine it with data from other sources, apply various
built-in filters, and have the output directed to a web page or to
other users’ pipes” [Ot]. Some people consider Yahoo Pipes to be
“a milestone in the history of the internet” [Ot]. The most
interesting part in this, is that a user does not really need to have
technical experience to get started with Yahoo Pipes.
However, the limitation of Yahoo Pipes and other mashup editors
is that they focus only on providing encapsulated access to some
public APIs or feeds. Web feeds are capable of only representing
news items; they are not capable of representing structured-data
retrieved from the so-called Data Web. Currently, the
development of data mashups requires extensive programming
skills.
1.1 Position (Mashups as a Data Retrieval Paradigm)
To build on the remarkable success of Web 2.0 mashups and
expose the data web to its full potential, we propose to regard the
Internet as a database; each data source is seen as a table, and
each mashup as a query. The average user should be able to build
mashups intuitively, and mashups should be reusable; so that
people can reuse and build on each other’s results. Mashups
should not be limited to sophisticated queries, but can also be
simple inquiries, e.g., “What books and articles Hacker wrote last
month”. In other words, we propose to also regard mashups as a
mechanism for general-purpose data retrieval.
In what follows we present our motivation of using RDF and
SPARQL in data mashups. The main challenges to building such
data mashups are presented at the end of this section.
1.2 Why RDF and SPARQL
Assuming the Internet data is represented in RDF, querying this
data can be done using SPARQL [P], the RDF query language.
SPARQL is a recent recommendation by the W3C. It allows one
to query remote RDF resources, in a manner similar to the
querying of databases using SQL. A SPARQL query is a set of
graph patterns; any data triple matching these patterns is added to
the query results. An example shown in Figure 1 retrieves
Hacker’s articles published after 2000, from two web locations.
http://Site1.com/RDF
:a1 :Title “Web 2.0”
:a1 :Author “Hacker B.”
:a1 :Year 2007
:a1 :Publisher “Springer”
:a2 :Title “Web 3.0”
:a2 :Author “Smith B.”
http://Site2.com/RDF
:4 :Title “Semantic Web”
:4 :Author “Tom Lara”
:4 :PubYear 2005
:5 :Title “Web services”
:5 :Author “Bob Hacker”
Query:
PREFIX S1: <http://site1.com/rdf>
PREFIX S2: <http://site1.com/rdf>
SELECT ? ArticleTitle
FROM <http://site1.com/rdf>
FROM <http://site2.com/rdf>
WHERE {
{{?X S1:Title ?ArticleTitle}UNION
{?X S2:Title ?ArticleTitle}}
{?X S1:Author ?X1} UNION {?X S2:Author ?X1}
{?X S1:PubYear ?X2} UNION {?X S2:Year ?X2}
FILTER regex(?X1, “^Hacker”)
FILTER (?X2 > 2000)}
Results:
ArticleTitle
Web 2.0
Figure 1. An example of a SPARQL query.
RDF is expressive in representing data and its semantics in a
portable and linked form [BHB], and other data formats can be
easily mapped into RDF [Wr]. Although RDF has been
standardized by W3C since 1999 ˗to play the role of a
semantically enabled metadata model˗ only recently has it
received a special attention from leading companies. For example,
Yahoo announced that the next generation of their search engine
will understand web semantics through RDF [Y]. Several models
of RDF (such as RDFa and eRDF, microformats, and standard
vocabularies) will also be supported by Yahoo. MySpace
announced that they are adopting the semantic web technology
and that they will use RDF for profile and data portability [O].
Upcoming is already publishing their content in microformats and
RDFa. Furthermore, Oracle 11g supports RDF storage and query.
Querying RDF in Oracle is done in a SPARQL-like style. As
shown by Oracle in [CDES], this implementation is scalable. For
example, a query with a medium size complexity over 80 Million
RDF triples (5.2 GB) takes one or few seconds. This support from
the leading companies is indeed accelerating the adoption of RDF
as the main metadata language. Therefore, we believe that RDF
and SPARQL mashups will be an important trend of web
applications in the near future.
In particular, we expect a wide spread of RDFa -embedding RDF
triples inside XHTML. For example, when searching (Yahoo
Local, Scholar, Jobs.ac.uk, etc.), the results will be represented in
RDFa, so that both humans and machines can understand these
results. In this way, the web is seen as set of RDF sources that can
be queried and processed as structured and linked data, or the socalled data web1.
1
We are not strict that the vocabulary in an RDF source is grounded to an
agreed and shared ontology. Hence, we prefer (as O'Reilly [M]) to call a
web of RDF sources as data web, rather than semantic web. Ontologies
will be learnt and used as a next step.
1.3 Challenges to Building Data Mashups
The problem is that building data mashups requires high
programming skills and intensive efforts. There is no yet an
approach to easy access and consume structured data on the web.
in the following we specify two challenges. Although we focus on
RDF/SPARQL mashups, but other data formats and query
languages do face similar challenges.
The schema challenge. Understanding the structure of an RDF
source (in order to formulate a query about it) is a challenging
task indeed. Before a user formulates a query, she needs to know
how the data is structured, and what are the labels of the data
elements, i.e., the schema. Web users are not expected to
investigate “what is the schema” each time they search or filter
structured information. This issue is particularly more difficult in
case of RDF. RDF data may come without a schema, or the
schema is mixed up with the data. In addition, as RDF data is a
graph, one have to navigate this graph in order to formulate a
query about it. Imagine a large RDF source with diverse content,
how you would manage to understand the data structure, interrelationships, namespaces, and the unwieldy labels of the data
elements. In short, formulating queries in open environments,
where data structures and vocabularies are unknown in advance,
is a hard challenge, and may hamper building data mashups by
non-IT people, i.e., regarding data mashups as a mechanism for
general-purpose data retrieval.
The pipelining challenge. It is very likely that the output of a
mashup is used as input to another. For example, one may create a
mashup by querying Upcoming to retrieve all events happening in
Paris (Q1); another person may create another mashup based on
Q1 to only retrieve the scientific events (Q2); a third person may
mash up the results of Q2 with the events retrieved from the ACM
Conferences (Q3); and so on. Mashups that connect to each other
in this way are called pipes. Although the pipelining concept is
very useful indeed, because it enables people to collaborate and
build on each others’ efforts (which we call social collaboration
and innovation), however, allowing queries to be pipelined in an
open environment may lead to poor performance and
complexities. For example, it is possible to end up with loops; the
results of mashup A is used as input in mashup C, and the results
of C is used by A. Such loops may lead to empty results after
some iterations, and may cause computational overheads.
Furthermore, when querying a set of remote RDF data sources,
these sources must be downloaded and stored locally (at the
query-engine side) before executing the query [MPTL]. The time
complexity of the query itself is not the problem (as queries on
local data are scalable [CERE]), however, the bottleneck will be
the downloading time. The more queries call each other, the
worse the performance may be. The time required to execute a
pipe of queries depends on the pipeline depth, the amount of
transferred data, and the transferring bandwidth.
This article proposes a mashup language called MashQL, which
uses SPARQL as a backend query language. It encapsulates the
complexity of SPARQL and allows people to query RDF sources
intuitively (see Figure 2). In the background, MashQL queries are
translated into and executed as SPARQL queries. The novelty of
MashQL is that it allows one to formulate a query over a data
source(s) without any prior knowledge about its schema. MashQL
does not also assume its users to have any general knowledge
about RDF or SPARQL to get started. Hence, the average internet
user can use MashQL to develop data mashups easily.
Furthermore, MashQL supports pipelines and materialized queries
as built-in concepts.
2. Related Work and Contributions
In this section we overview different approaches to query
formulation, focusing on the usability of these approaches for
non-IT people.
Accessing and consuming structured data is a task that is usually
limited to professional programmers. This is indeed a general
challenge and is not only limited to RDF and SPARQL. Non-IT
professionals who cannot write queries in a technical language
(like SQL) are limited to accessing databases using forms.
Query-by-form is an old practice; users can fill in and submit a
form, where all fields in this form are seen as query variables.
This way of data access is simple; however, it is neither flexible
nor expressive. For each query, a form needs to be developed, and
any change to the query implies changing the form.
Query-by-example allows users to formulate their queries as
filling a table [Z]. The names of the queried relations and fields
are selected first; then users can enter their keywords. Although
this approach is claimed to be easy to learn by non-IT people,
however, it was not used by such people. In our opinion, this is
because users are still required to understand the relational
structure, which is difficult for non-IT people.
Conceptual query languages are an alternative approach to
query formulation. As many databases are modeled conceptually
using EER or ORM diagrams, one can also query these databases
starting from those diagrams. Users can select some concepts
from a given conceptual diagram, and their selection is
automatically translated into SQL queries. This scenario was
implemented by several EER-based [CERE,PS] and ORM-based
[DMV, HPW] approaches. ConQuer [BH] is another ORM-based
language, but it has some nice features indeed. Instead of starting
from a conceptual diagram that may not exist, it starts from the
logical schema and converts it into lists of concepts and relations.
Users can then drag-drop from these lists to formulate their
queries. What users drag-drop become a tree of facts, and this tree
is seen as a query. Although this drag-drop scenario is not simple,
however structuring a query as a tree-pattern looks intuitive
indeed.
Although conceptual query languages received a considerable
amount of research, but none of these languages was used in
practice. In our opinion, this is because formulating a query
starting from a conceptual diagram is still a difficult task for nonIT people. In addition, the need to query databases at the
conceptual level is not an important issue, because a database is a
single enterprise’s project, and the world of its developers and
users is closed. In open worlds such as the Web, structured data is
being created and consumed by different users, and the need for a
mechanism to mash up and consume this distributed and
heterogeneous data easily is a real demand.
Furthermore, some techniques start to emerge for filtering streams
of information (see e.g. Outlook, Yahoo Pipes, Google Base). For
example one can block/permit an item (e.g., email, feed, job) if it
contains (or less than, same as...) a certain value.
In the recent years, we are observing some advanced techniques
start to emerge for filtering streams of information, which we call
Query-by-Filter. For example, Yahoo Pipes does not support any
query language; however, the Filter module has some general
concepts, which allow people to permit/block items according to a
certain set of conditions. One may block any web feed that
contains (/doesn’t contain/the same as/ LessThan…) a certain
keyword in the title of a feed. This way of expressing filters is
also used in most email applications for filtering and organizing
emails. Google Base allows one to search information in a
mixture of query-by-form and query-by-filter manner. Although
the flexibility and expressivity of such filters are very limited, if
compared to query languages; however, this approach is well
understood and being successfully used by non-IT people.
Furthermore, the need for simplified query techniques is receiving
a high importance within the semantic web community. Most
approaches are proposing to Visualize Triple Patterns, see
GRQL [ACK], iSPARQL [iS], NITELIGHT [RSBS] and
RDFAuthor [Ra]. The idea is to represent triple patterns
graphically as ellipses connected with arrows, so that one would
need less programming skills to formulate a query. Other
semantic web approaches are suggesting to use visual scripting
languages, such as SPARQLMotion [Sm] and Deri Pipes [TPM].
Their idea is to allow users to use visual box and lines, however,
queries are written in a textual form. We found all of these
approaches assume advanced knowledge of RDF and SPARQL,
thus cannot be used by the casual user.
Please refer to [KB] about a usability study on what casual users
prefer, which concludes that a query language should be close to
natural language and graphically intuitive.
MashQL is a generalization and extension to many aspects of the
above approaches, yielding a formal and expressive but yet
simple, query-by-diagram language. MashQL inherits some
aspects from conceptual queries and query-by-filters. Similar to
ConQuer (and somehow LISA-D), MashQL queries are
represented as trees, which makes queries easy to understand.
Tree branches in MashQL are similar to filtering rules, which
makes query formulation as simple as building filters. The lookand-feel of the MashQL is inspired from Yahoo Pipes.
The difference between MashQL and the query-by-filter
approaches is that MashQL is a general language for querying any
structured data, not only filtering a specific structure of a data
stream, as in Yahoo Pipes. In addition, unlike conceptual queries
that start from a conceptual or logical database schema, MashQL
is fundamentally different as it assumes that it is not necessary for
the queried data sources to have a schema at all.
3. The Basics of MashQL
MashQL is a query-by-diagram language. The idea of MashQL is
to allow people to query, mash up, and pipeline structured data
intuitively. In the background MashQL queries are automatically
translated into and executed as SPARQL queries. The novelty of
MashQL is that it allows one to query (and navigate) an RDF
graph without any prior knowledge about its structure or
technical details; as well as it supports query pipelines as a builtin concept. Figure 2 shows a MashQL query that is equivalent to
the SPARQL query in Figure 1. The first module specifies the
query input, while the second MashQL module specifies the
query body. The output of this query can be piped into a third
module (not shown here), which renders the results into a certain
format (such as HTML or XML), or as RDF input to other
MashQL queries.
Figure 2. An example of MashQL query.
3.1 The Intuition of MashQL
The intuition of MashQL is described as the following: Each
MashQL query is seen as a tree. The root of this tree is called the
query subject (e.g. Article), which is the subject matter being
inquired. Each branch of the tree is called a query restriction and
is used to restrict a certain property of the query subject. Branches
can be expanded to allow sub trees (called query paths), which
enable one to navigate the underlying data sources (see Figure 3).
This query retrieves the recent articles from Cyprus, i.e. every
article that is written by an author, who has an address, this
address has a country called Cyprus, and the article is published
after 2000.
PREFIX …
SELECT ?ArticleTitle, ?Institute
FROM …
WHERE {
?Article :Title ?ArticleTitle
?Article :Author ?X1
?X1 :Address ?X2
?X2 :Country ?X3
OPTIONAL{?X1 :Affiliation ?Institute}
?Article :Year ?X4
FILTER (?X3 = “Cyprus”)
FILTER (?X4 > 2000)}
Figure 3. Query paths (/sub trees) in MashQL.
3.2 The Query Formulation Protocol
Formulating MashQL queries is designed to be an interactive
process, in which the complexity and the responsibility of
understanding data structures are moved from the user to the
query editor. Users only use drop-down lists to express their
queries. While interacting with the query editor, the editor
performs some background queries and dynamically generates
these lists. In what follows, we describe these background queries.
• After a user selects the dataset in the RDF Input module,
formulating a MashQL query is done by first selecting the
query subject, which is offered through a drop-down list
generated from (the union of all subject and object identifiers2
in the RDF dataset), no matter whether an identifier represents
an instance or a type. Users can also choose not to select from
the list and introduce their own subject label. In this case, the
subject is seen as a variable and displayed in italic, which
means any data item3.
• To add a restriction on the chosen subject, a (list of the
possible properties for this subject) is dynamically generated.
For example, given the data in Figure 1, if a user chooses a1
as a subject, the list of the a1’s properties will be {Title,
Author, Publisher, Year}; if the subject is a variable, the list
will be the set of all properties in the dataset. Users can also
choose whether this property is required, optional, or
unbound. If a property is prefixed with “maybe” this property
is considered optional (see Figure 4), if it is prefixed with
“without” it is considered unbound, and if it is not prefixed
then it is required.
• Users may then choose an object filter such as (Equals,
Contains, Doesn’t contain, OneOf, Between, MoreThan, Not,
etc.), or may select an object identifier from a list, which is
generated from (the set of the possible objects, depending on
the previously chosen subject and predicate). Furthermore,
users can also click on the restriction icon to expand the tree,
i.e., declare a query path as shown in Figure 3. The symbol 5
can be used before subject, property, or object variables to
indicate that this variable will be returned in the results. For
example, the results of the above query are a one-column
table that contains the list of all retrieved titles4.
3.3 MashQL Intuitiveness.
The trade-off between expressivity and simplicity in MashQL is
achieved by making technical variables and namespaces to be
implicit, and through the tree structure of MashQL queries, which
is close to the intuition people use in their natural language
communication. For example, the query path shown in Figure 3
means, retrieve the article that has an Author x1, and x1 has an
address x2, and x2 has a country x4, and x4 equals “Cyprus”.
Furthermore, suppose you would like to ask; “Give me the list of
2
3
People not familiar with the RDF terminology may glance section 4.1.
The default value for a subject (in case a user does not select from the
offered list or introduce his own label) is the variable “Everything”.
4
MashQL supports some novel interface issues that are lengthy or difficult
to illustrate here, especially the edit-and-verbalize modes. For example,
when a user clicks on a restriction, it gets the editing mode and all other
restrictions get the verbalize mode (i.e., all boxes and lists are made
invisible, but the verbalization of their content is generated and displayed
instead, as shown in all figures). This does not only make the query
formulation process even easier and joyful, but more importantly, from a
methodology viewpoint it makes the readability of the queries closer to
natural language, by which users are guided to achieve what they intended
to query. Similarly, when two predicates originating from different sources
have the same label, their namespaces are hidden and one of them is
displayed, unless the user decided not so.
all stores that sell parts of the iPhone mobile, and that are located
in Brussels”; or, “Which cinemas are located in Brussels, offer a
movie called ‘Fahrenheit’ and will be played between 20:00 and
23:00”. Apart from some terms (such as: give me the list of all,
which, who, that are), all of these inquiries can be directly
converted into MashQL queries. Hence, MashQL can be used by
both the average internet users and IT professionals to create data
mashups intuitively and as expressive as SPARQL.
4. The Formal Definition of MashQL
Before presenting the formal definition of MashQL, we define the
structure of a data source.
4.1 Dataset
MashQL assumes the queried dataset is structured as (or mapped
into) a directed labeled graph, i.e. similar to but not necessarily
the same as RDF. For example, suppose you have a person with
the firstname Bob, surname Hacker, working at a company called
IBM and located in the USA. This information can be represented
as primitive facts forming the graph: {<P1,Firstname,“Bob”>,
<P1,Surname,“Hacker”>,<C1,Name,”IBM”>,<C1,Country,”USA”>,<P1,WorkAt,C1>,<C1,T
ype,Company>,<P1,Type,Person>}.
Notice that the last two triples declare
P1 as an instance of a person and C1 as an instance of a
company5. A MashQL dataset D is a set of data triples <Subject,
Predicate, Object> forming a directed labeled graph. The value of
a subject and a predicate can only be an identifier I, i.e. a URL or
a key; the value of an object can be an identifier I or a literal L.
Def.1 (Dataset): A dataset D is a set of triples, each triple t is
formed as <S, P, O>, where S ∈ I, P ∈ I, and O ∈ I ∪ L.
The only difference between this definition and the RDF model is
that we allow an identifier to be any form of a key (i.e. not
necessarily a URL). Allowing an identifier to be a database key,
for example, would enable MashQL to be used for querying
databases. Relational databases (and XML) can be converted to
this primitive data structure. For example, the primary key of a
table can be seen as a subject, a column label as a predicate, and
the data-entry in that column as an object. Foreign keys represent
relationships between data elements across tables.
Each object literal must have a datatype. If an object value does
not have an explicit datatype, it can be implicitly assumed by
taking advantage of XML and RDF conventions: in RDF and
XML, the general syntax for literals is a String, enclosed in either
double or single quotes; Integers can be written directly without
quotes or explicit datatype; Boolean literals are written as true or
false; and so on. In RDF and XML Schema, stating a datatype
explicitly is done using namespaces, such as: "1"^^xsd:integer,
"2004-12-06"^^xsd:date.
Def.2 (Typed Literals): Every object literal must have a datatype
D: If O ∈ L then O ∈ D.
Object literals may also have a language tag Lt (such as “En”,
“Gr”, “It”) to indicate to which language the object value belongs.
5
The “Type” predicate (which somehow represents the schema of an RDF
source) might not be declared at all, i.e. poorly schematized data. This is
another reason why query formulation in MashQL does not start from
data schemes, if compared with the conceptual query languages (See our
discussion in section 2).
In the RDF best practice, language tags are expressed using @
followed by a language tag, such as “Person”@En, “Ατομο”@Gr.
Def.3 (Language Tags): An object literal (O ∈ L) may have a
language tag Lt.
4.2 The MashQL Basics
This section presents the formal definition and the meaning of
MashQL constructs. The SPARQL interpretation of each MashQL
construct is presented as a mapping rule after each definition.
A MashQL query is seen as a tree. The root of this tree is called
the query subject, and each branch is called a restriction. A
MashQL query can be defined on a one, or a merge6 of several
data sources, called the dataset. All triples in the dataset matching
the query restrictions are retrieved.
Def. 4 (Query): A Query Q with a subject S, denoted by Q(S), is
a set of restrictions on S. Q(S) = R1 AND … AND Rn.
Def. 5 (Subject): A subject S ∈ (I ∪ V), where I is an identifier
and V is a variable.
Def. 6 (Restriction): A restriction R = <Rx , P, Of>, where Rx is
an optional restriction prefix that can be (maybe | without), P is a
predicate (P ∈ I ∪ V), and Of is an object filter (which shall be
defined shortly).
The meaning of the MashQL restrictions is described as the
following: When evaluating a query Q(S) against a dataset D,
only the triples that satisfy all restrictions are retrieved, such that:
First, if a restriction does not have a prefix (R=< , P, Of>), the
truth-evaluation of the restriction is considered true if the subject,
the predicate, and the object-filter are matched (see the first two
restrictions in Figure 4). This case is mapped into a normal graph
pattern in SPARQL (see rule-3). Second, if a restriction is
prefixed with “maybe” (R=<maybe, P, Of>), its truth-evaluation
is considered always true (see the third restriction in Figure 4).
This case is mapped into an optional graph pattern in SPARQL
(see rule 4). Third, if a restriction is prefixed with “without”
(R=<without, P, Of>), it is truth-evaluation is considered true if
the subject S and the predicate P do not appear together in a triple
(see the last restriction in Figure 4). In SPARQL, this restriction
means that the object O should not be bound (see rule 5).
The following rules map the MashQL constructs into SPARQL:
Rule-1: The symbol 5 before a variable means that it will be returned in
the results; i.e., included in the SELECT part of in SPARQL.
If the output of the query is input to another, use CONSTRUCT *
Rule-2: In any of the following rules, if a subject, predicate, or object is
italicized: it is seen as a SPARQL variable, i.e. prefixed with “?”.
Rule-3: If S is a subject and R = < , P, Of>, the mapping is:
{S P O}.
Rule-4: If S is a subject and R = <maybe, P, Of>, the mapping is:
{OPTIONAL{S P O}}.
Rule-5: If S is a subject and R = < without, P, Of>, the mapping is:
{S P O. FILTER (!bound(?O))}.
Example. The query in Figure 4 means: retrieve everything (call
this thing as a Person) that: has a name (call it as PersonName);
works at Yahoo; possibly studying somewhere (call it as
University); and, does not have any salary. In other words, when
evaluating this query, we retrieve the triples that have a subject
that i) appears in a triple with the predicate Name, ii) appears in a
6
In this paper we do not discuss how several RDF sources can be merged
into one graph, which is part of the W3C standard. Basically, all triples
are put together, but prefixed with respective namespaces.
triple with the predicate WorkFor and the object Yahoo, iii) may
appear in a triple with the predicate StudyAt, and iv) does not
appear in a triple with the predicate Salary.
SELECT ?PersonName, ?University
WHERE {
?Person :Name ?PersonName.
?Person :WorkFor :Yahoo.
OPTIONAL{?Person :StudyAt ?University}
?Person :Salary ?X1.
FILTER (!Bound(?X1))}
}
Figure 4. A MashQL query and its mapping into SPARQL.
The above definitions specify the simple forms of restrictions, i.e.
without object filters Of. In what follows, we define nine forms of
object filters, which allow further restrictions based on the value
of an object (such as Equals, Between, OneOf, Not, etc). The last
definition (Def.7.9) specifies how to allow sub-queries (i.e. query
paths) based on object values.
Def.7 (Object Filter): An object filter Of = <O, f>, where O is an
object and f is a filtering function. An object filter can have one of
the following nine forms:
1. Of = <O>, where O is an object, O ∈ V ∪ I. This is the
simplest object filter, i.e., it does not add any restriction on the
object value of the retrieved triples, as shown in Figure 4.
2. Of = <O, Equals(X, T, Lt)>, where X can be a variable or a
constant, T is a datatype, and Lt is a language tag. This filter
restricts the retrieved results, such that, the object value O
should be equal to X, with datatype T, and with language Lt.
See the restriction (Country “Cyprus”) in Figure 3. This object
filter is mapped into a simple SPARQL filter (see rule-6).
3. Of = <O, Contains(X, T, Lt)>, where O is an object variable, X
is a regex literal, T is a data type, and Lt is a language. This
filter restricts the retrieved results, such that, the object value O
should be equal to regex(X), with datatype T, and with language
Lt. A regex literal is a literal that contains a regular expression
matching pattern. See the restriction (Author “^Hacker”) in Figure
2. This object filter is mapped into a SPARQL filter with a
regex function (see rule-7).
4. Of = <O, MoreThan(X, T)>, where O is an object variable, X is
a variable or a constant, T is a datatype. This filter restricts the
retrieved results, such that, the object value O should be more
than X and with datatype T. See the restriction (Year > 2000) in
Figure 3, the general SPARQL mapping rule-8.
5. Of = <O, LessThan(X, T)>, where O is an object variable, X is
a variable or a constant, T is a datatype identifier. This filter
restricts the retrieved results, such that, the object value O
should be less than X and with datatype T (see rule-9).
6. Of = <O, Between(X, Y, T)>, where X and Y are variables or
constants, T is a datatype identifier. This filter restricts the
retrieved results, such that, the object value O should be more
than or equals X, less than or equals Y, and with datatype T.
7. Of = <O, OneOf(V)>, where O is an object variable, and V is a
set of values {v1, ... , vn}, vi is a variable or constant. This filter
restricts the retrieved results, such that, the object value O
should be equal to one of the values in V. See Figure 13. Notice
that this construct is not supported in SPARQL, but can be
emulated using filters (see rule-12).
8. Of = <O, Not(f)>, where f is one of the functions defined above.
This filter extends all of the above functions with simple
negation. As an example, see the query restriction (-Lifestyle Not
in Figure 15. The filter is same as the Equals
filter but with negation, i.e., Not Equal.
9. Of = <O, Qi(O)>, where O is an object (O ∈ V ∪ I), and Qi(O)
is a sub-query with O being the query subject. The restrictions
defined in the sub-query Qi(O) should be satisfied as well (e.g.
see Figure 3). Notice that this definition is recursive; however,
this does not mean the query itself is recursive.
A
SELECT ?Person
WHERE {
?Person :WorkFor :Google
UNION
?Person WorkFor :Yahoo}
B
SELECT ?FName
WHERE {
?Person :Surname ?FName
UNION
?Person :Firstname ?FName}
“Vitamins in Take“)
Rule 6. If Of = <O, Equals(X, T, Lt)>:
Append the mapping with: FILTER(?O = X)
If T ≠ Null: Append the mapping with: FILTER(datatype(?O)=T)
If Lt ≠ Null: Append the mapping with: FILTER(lang(?O) = Lt)
Rule 7. If Of = Contains(X, T, Lt)>:
Append the mapping with: FILTER regex(?O, X)
If T ≠ Null: Append the mapping with: FILTER(datatype(?O)=T)
If Lt ≠ Null: Append the mapping with: FILTER(lang(?O) = Lt)
Rule 8. If Of = <O, MoreThan(X, T)>:
Append the mapping with: FILTER(?O > X)
If T ≠ Null: Append the mapping with: FILTER(datatype(?O=T)
Rule 9. If Of = <O, LessThan(X, T)>:
Append the mapping with: FILTER(?O < X)
If T ≠ Null: Append the mapping with: FILTER(datatype(?O=T)
Rule 10. If Of = <O, Between(X, Y, T)>:
Append the mapping with: FILTER(?O >=X)&& FILTER(?O<=Y)
If T ≠ Null: Append the mapping with: FILTER(datatype(?O)=T)
Rule 11. If Of = <O, OneOf (V)>:
Append the mapping with:
{FILTER(?O = V1)|| . . . || FILTER(?O = Vn)}
If Vi is a regex-ed literal, the ith filter above should be replaced with:
C
D
SELECT ?AgentName, ?AgentPhone
WHERE {
{?Person rdf:type :Person.
?Person :Name ?AgentName.
?Person :Phone ?AgentPhone}
UNION
{?Company rdf:type :Company.
?Company :Name ?AgentName.
?Company :Phone ?AgentPhone}}
SELECT ?CustName,
WHERE {
?Person :Name ?CustName.
UNION
{?Company :Title ?CustName.
?Company :City ?X1.
FILTER regex(?X1, “Paris”)}}
Figure 6. MashQL’s support of unions.
Rule 16. Given On , If n >1 and Oi ∈ I : The mapping in rules 3-4 will be:
{{S P :O1} UNION . . . UNION {S P :On}}
Rule 17. Given Pn , If n >1 and Pi ∈ I : The mapping in rules 3-4 will be:
{{S :P1 O} UNION . . . UNION {S :Pn O}}
Rule 18. Given Sn , If n >1 and Si ∈ I : Regenerate the query n times,
each time with Si as a root, and with a UNION between the queries.
Rule 19. Given Qn , If n >1 : Add UNION between the n queries.
FILTER Regex(?O, Vi)
Rule 12. If Of = <O, Not(f)>:
The f filter will be generated as above, but with a negation.
Rule 13. If Of = <O, Qi(O)>:
Repeat all mapping rules to generate Qi(O).
4.3 Other Constructs
There are other constructs that are not presented in this article for
the sake of simplicity. This section highlights some of them.
Def.8 (Types): A subject (S ∈ I) or an object (O ∈ I) can be
prefixed with “a” or “an” to mean the instances of this
subject/object type, instead of the subject/object itself.
Def.10 (Reverse): <~P> indicates the reverse of the predicate P.
Let R1 be a restriction on S such that <S P O>, and R2 be <O ~P
S>, R1 and R2 have the same meaning.
This construct allows concise queries. For example, given a
dataset D = {<:C1, :name, “IBM”>, <:P1, :name, Bob>, <:P1,
:WorkFor, :C1>}. In order to retrieve the company names that
people work for (see the right-side query in Figure 7), which
might be not intuitive for some people, one would use the reverse
construct to retrieve the same results as shown in the left-side.
The following query retrieves every instance of the concept
Person that works for an instance of the concept Company.
SELECT ?Person, ?Company
WHERE {
?Person rdf:type :Person
?Person :WorkFor ?Company
?Company rdf:type :Company }
Figure 5. MashQL’s support of types.
Rule 14. If a subject S is prefixed with “a” or “an”:
Append the mapping with: {?S rdf:type :S}
Rule 15. If an object O is prefixed with “a” or “an”:
Append the mapping with: {?O rdf:type :O}
Def.9 (Union): A union can be declared between objects,
predicates, subjects and/or queries, in the following forms:
1. On = <O1\O2 \ . . . \On>, to indicate unions between objects,
where Oi ∈ I. See Figure 6.A.
2. Pn = <P1\P2 \ . . . \Pn>, to indicate unions between predicates,
where Pi ∈ I. See Figure 6.B.
3. Sn = <S1\S2 \ . . . \Sn>, to indicate unions between subjects,
where Si ∈ I. See Figure 6.C.
4. Qn = <Q1\Q2 \ . . . \Qn>, to indicate unions between queries,
See Figure 6.D.
Figure 7. MashQL’s support of reverse predicates.
Mapping the reverse construct into SPARQL is done by simply
swapping the positions of the subject and the object (see rule-13).
Rule 20. If S is a subject and R = <~P, O>, the mapping is:
{O P S}.
5. Query Pipelines in MashQL
As discussed earlier, allowing people to pipeline queries (i.e.,
build on each other’s results) is an important requirement for
social collaboration and innovation. Similar to Yahoo Pipes,
MashQL supports query pipelines as a built-in concept, rather
than only an interface issue. For example, depending on the
pipeline structure, MashQL generates either SELECT or
CONSTRUCT queries (see Figure 13). In addition, to overcome
the pipelining challenges described earlier, MashQL allows query
results to be materialized, and enforce pipelines to be acyclic. The
result of a query (called derived source) is stored physically and
deployed as a concrete RDF source. Primal input sources (called
base sources) are also cached for performance purposes.
Dif 11: Given a query Q over a set of base or derived sources
{D1,…,Dm}, the results of this query is denoted as D =
Q(D1,…,Dm), and D ∉ {D1,…,Dm}.
We define a Pipe as an acyclic chain of SPARQL queries, where
the results of a query is an input to the next query (see Figure 8).
Dif 12: Given a derived RDF source D, the chain of the queries
that derives D is acyclic, and denoted as the pipe P(D).
For example, the pipe P(D10) in Figure 8 is the chain
Q5(Q3(Q1(D1,D2),Q2(D2,D3,D4)),D8), and the pipe P(D3) is the
chain Q3(Q1(D1,D2),Q2(D2,D3,D4)). A pipe can also be defined
recursively, such as P(D10) = Q5(P(D7), P(D8)), and P(D3) =
Q3(P(D5), P(D6)). If D is a base source, then P(D) = D. In other
words, a pipe to a base resource leads to the resource itself. As we
shall discuss shortly, refreshing a pipe P(D) implies re-executing
all of the queries in this pipe, while refreshing Q(D) implies reexecuting only the query that derives it.
time stamp of its previous successful execution RQT (such as
RQ317052007,13:04:53); and an auto-refresh time interval (such as
RQ32592000). Each query will be automatically executed if its autorefresh interval expires and one of its input sources is updated.
Notice that the auto-refresh boundaries might be enforced by the
system in order to limit heavy and undesired automatic
executions.
Dif 14: (Query Auto-Refresh) Let Qi be a query over a set of
sources {D1,…,Dm}, and T is a given time. Qi will be executed if
(RQiT + RQiA) ≤ T and (RQiT < RDjT), where 1 ≤ j ≤ m.
Similarly, a pipe P(D) will be automatically refreshed if RDA
expires. Refreshing a pipe implies executing the chain of queries
in this pipe, starting from the bottom to the topmost.
Dif 14: (Pipe Auto-Refresh) Let P(D) be a pipe, D = Qn
(D1,…,Dm), and T is a given time. If (RDT + RDA) ≤ T, each ith
query in P(D) will be executed if (RQiT < RDjT ), where 1 ≤ j ≤ m
for Qi, and 1 ≤ i ≤ n. Queries in P(D) are executed from the
bottom to the topmost, or recursively as P(P(D1),…,P(Dm)).
Although these updating strategies are simple and might be
sufficient in many cases in practice, however, as shown in the
data warehousing literature, incremental updates strategies (e.g.,
[AD, ZGHW]) are indeed the efficient. The idea is that if a base
source receives a new transaction (insert or delete), only these
transactions are transformed and the affected queries are
refreshed. We would like to remark that incremental updating is
still an open research issue for RDF; especially that, unlike data
warehouses, RDF sources and their queries live in an open world.
5.2 Performance-based Materialization
In this section, we present an algorithm to detect which queries
are most important to materialize. Based on the log of the
pervious execution of a query, and based on how many times the
results of this query are used, the algorithm finds the most
expensive and important queries, and suggest whether a query
shuld to be materialized or not.
As our implementation prototype is implemented using Oracle
11g, we extend the XXX table, which is a system table that
Oracle generates to keep a log of the execution of queries. This
table contains …. We extend this table with …. To be continued
soon
Figure 8. Depict of Semantic Web Pipes.
We call the problem of keeping a pipe up-to-date, the pipes
consistency.
Dif 13: (Consistency) Let D be the results of a query Q over a set
of sources {D1,…,Dm}. Let T be the latest time the set {D1,…,Dm}
has been changed. Then, D is consistent at T, if D= Q(D1,…,Dm).
5.1 Updating Strategies
To maintain the pipes consistency, we propose two simple
updating strategies: Query auto-refresh and Pipe auto-refresh. Let
us first introduce the notion of source registry RD and the notion
of query registry RQ, as the following: Each derived or base
source D has a time stamp of its last update RDT (such as
RD113042008,23:44:23); and an auto-refresh time interval RDA (such as
RD12592000 , i.e. to be refreshed every month). Each query has a
6. Implementation Issues
MashQL can generally be implemented by online mashup editors
(similar to or as an extension to Yahoo Pipes), or as a query
interface for online RDF datasets (e.g., Freebase, DBpedia, or
DBLP). It can be implemented also as a query plug-in to offline
RDF stores (e.g., AllegroGraph or Oracle); or it can be used to
filter metadata streams in, e.g., iTunes, jobs.ac.uk, eBay, or
Upcoming.
We are currently developing a MashQL prototype in a Yahoo
Pipes style. This prototype is an AJAX web-based plug-in to
Oracle 11g. Choosing Oracle is not only because of its scalability,
but also because Oracle’s SPARQL inherits all functionalities of
SQL, including aggregation and grouping functions, which are not
supported in the standard SPARQL. Our implementation
prototype does not only generate the W3C’s standard SPARQL,
but being extended to also generate Oracle’s SPARQL [CDES].
certain set or a standard of metadata elements. Any element can
be included by specifying its "name" and "content", see Figure 9.
Although this article focuses on using MashQL for querying RDF
data sources using SPARQL, however, MashQL can be similarly
used for querying relational databases or XML documents. In this
case, one needs to either develop a stylesheet that translates
MashQL markups into SQL, XQuery, or any preferred backend
query language; or maybe map the dataset (e.g. using views) into
an RDF-like model.
The RDFInput section represents RDF sources. Each RDFInput
element represents an RDFInput module in a pipe, which can be
zero or many. Each module represents the URL of the RDF
sources, the latest upload, caching locations, and order in the
module.
6.1 MashQL Markup
Based on the formal definition of MashQL that we have presented
in section 4, this section presents the technical specification of
MashQL queries, called the MashQL markup. The goal of this
markup is to serialize graphical MashQL queries in a textual and
interchangeable format. In this way, tools will be able to save,
load, process, and exchange queries easily.
Example. Figure 9 illustrates the markup of the MashQL pipes
shown in Figure 2, which is also equivalent to the SPARQL query
in Figure 1. In appendix A2 we also illustrate the markup of the
job-seeking use case shown in Figure 13.
<Pipe ID="0" xsi:noNamespaceSchemaLocation="MashQL.xsd"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Meta Content="Title" Name="Bob's Articles"/><Meta Content="Creator" Name="Bob Hacker"/>
<RDFInput ID="1" X1="10" Y1="10" X2="50" Y2="110">
<Source Order="1">
<Ref>http://www.site1.com/rdf</Ref>
<LastUpload>2008-06-02T09:30:47.0Z</LastUpload></Source>
<Source Order="2">
<Ref>http://www.site2.com/rdf</Ref>
<LastUpload>2008-06-02T09:32:47.0Z</LastUpload></Source>
</RDFInput>
<Query ID="0" X1="0" X2="0" Y1="xs:integer" Y2="0" InputModule="1">
<Header> <Meta Content="Title" Name=""/>
<Prefix Tag="S1" Ref="http://www.site1.com/rdf"/>
<Prefix Tag="S2" Ref="http://www.site2.com/rdf"/>
</Header>
<Body>
<Subject Name="Article" Type="Variable" isRetrun="false"/>
<Restriction Prefix="">
<Predicate Name="S1:Title" Type="Identifier" isReturn="false" />
<Predicate Name="S2:Title" Type="Identifier" isReturn="false" />
<Object Name="ArticleTitle" Type="Variable" isReturn="true" /> </Restriction>
<Restriction Prefix="">
<Predicate Name="S1:Author" Type="Identifier" isReturn="false" />
<Predicate Name="S2:Author" Type="Identifier" isReturn="false" />
<Object Name="X1" Type="Variable" isReturn="false" />
<ObjectFilter xsi:type="Contains" Value="^Hacker" Type="Variable" Language="" DataType="" />
</Restriction>
<Restriction Prefix="">
<Predicate Name="S2;Year" Type="Identifier" isReturn="false" />
<Predicate Name="S2:PubYear" Type="Identifier" isReturn="false" />
<Object Name="X2" Type="Variable" isReturn="false" />
<ObjectFilter xsi:type="MoreThan" Value="2000" Type="Constant" DataType="" /> </Restriction>
</Body>
<Footer>
<Order><Variable Name="" Direction=""/></Order>
<Modifier Limit="" Offset="" Duplication=""/>
<Output Stylesheet="" Format=""/> </Footer> </Query> </Pipe>
The UserInput element represents UserInput modules, which can
be zero or many in a pipe. Each UserInput element consists of a
set of DataItems. Each Item is described by its Value, Type
(which can be: Constant literal, Regex-Constant literal, Variable,
or Identifier), Datatype, and language tag.
The Query element serializes MashQL query modules in a pipe. A
query consists of three main parts: Header, Body, and Footer. The
Header serializes the prefixes, and some metadata about the
query, such as its title, last execution, execution time cost, among
others. The Body element serializes mainly a query subject and its
conditions. As shown in Figure 10, a restriction consists of three
elements predicate, object, and an object filter. The ObjectFilter is
defined as an abstract element (an XML datatype) that can be one
of the following complex elements {Contains, Doesn’t Contain,
Equals, Not Equal, LessThan, Not LessThan, MoreThan, Not
MoreThan, Between, Not Between, OneOf, Not OneOf,
SubQuery}.
Finally, the Footer element serializes result modifiers (such as
limit, offset, and duplication parameters), ordering preferences,
and output styles. Notice that these parameters are usually not
displayed in the main window of a MashQL query, but are
configured as "query options".
Based on this XML schema we have developed a MashQL parser,
in order to read and write MashQL queries, see the UML class
Diagram in Figure 11.
Figure 9. An example of a MashQL query and its markup.
The XML schema of the complete MashQL markup can be found
in appendix A1. This schema is used as the reference MashQL
grammar. Any markup of a MashQL query should be valid
according to this XML-Schema. Notice that the MashQL markup
is not meant to be written by hand or interpreted by humans. It is
meant to be implemented for example, as a "save as" or "export
to" functionality in MashQL tools. As we will explain in the next
section, we use this schema to automatically translate MashQL
into SPARQL. The translator takes the markup of a MashQL
query as input and generates its SPARQL equivalence. Figure 10
presents a high-level depiction of the XML schema. As shown in
this figure, a MashQL pipe consists of four sections: Meta,
RDFInput, UserInput, and Query.
The Meta section is used to represent metadata about the pipe
itself, such as its title, creator, owner, last execution, publication
status, etc. These elements are used for management and
discovery of pipes. Notice that our design is not restricted to a
Figure 10. A high-level depiction of the MashQL XML schema.
Figure 11. The Class diagram of the MashQL-markup parser.
6.2 MashQL to SPARQL Translator
As explained earlier, MashQL is used for building data mashups
diagrammatically, however, in the background MashQL is
translated into and executed as SPARQL queries. Translating
MashQL into SPARQL is done according to the mapping rules
defined in section 4. These rules are implemented by the
MashQL-to-SPARQL translator, which is a java program. This
translator takes the markup of a MashQL query and generates its
SPARQL equivalent according to these rules. In Figure 12, we
present the use case diagram of the translator. When a user
executes or debugs a MashQL query, this query is translated into
SPARQL first, then the generated SPARQL is executed.
Figure 12. Use Case diagram of the translator.
7. Use Cases
This section presents five use cases of using MashQL for building
semantic data mashups. The goal of these use cases is to
demonstrate the utility of MashQL, and evaluate how much
MashQL is able to solve some real-life problems for non-IT
experts. These use cases also suggest novel and useful Semantic
Web applications in areas like job-seeking, data integration, web
information systems, life-sciences research, and business data
auditing.
7.1 Use case: Job Seeking
Bob has a PhD in bioinformatics. He is looking for a full-time,
well paid, and research-oriented job in some European countries.
He spent an enormous amount of time searching different job
portals, each time trying many keywords and filters. Instead, Bob
used MashQL to find the job that meets his specific preferences.
Figure 13 shows Bob’s queries on Google Base and on
Jobs.ac.uk. First, he visited Google Base and performed a
keyword search (bioinformatics OR "computational biology" OR
"systems biology" OR e-health); he copied the link of the retrieved
results from Google into the RDFInput module7; and then created
a MashQL query on these results. He performed a similar task to
query Jobs.ac.uk. The third MashQL module in Figure 13, mixes
the results of the above two queries and filters them based on
location preferences (provided in the UserInput module). The
SPRQAL equivalent to Bob’s MashQL query is shown in Figure
14.
…
CONSTRUCT *
WHERE {
{{?Job :Category :Health}UNION
{?Job :Category :Medicine}}
?Job :Role ?X1.
?Job :Salary ?X2.
?X2 :Currency :UPK.
?X2 :Minimun ?X3.
FILTER(?X1=“Research” ||
?X1=”Academic”)
FILTER (?X3 > 50000) }
…
CONSTRUCT *
WHERE{?Job :JobIndustry ?X1.
?Job :Type ?X2.
?Job :Currency ?X3.
?Job :Salary ?X4.
FILTER(?X1=“Education”||
?X1=“HealthCare”)
FILTER(?X2=“Full-Time”||
?X2=“Fulltime”)||
?X2=“Contract”)
FILTER(?X3=“^Euro”||
?X3=“^€”)
FILTER(?X4>=75000||
?X4<=120000)}
…
SELECT ?Job ?Firm
WHERE {?Job :Location ?X1. ?X1 :Country ?X2.
FILTER (?X2=“Italy”||?X2=“Spain”)||
?X2=“Greece”||?X2=“Cyprus”)}
OPTIONAL{{?job :Organization ?Firm}UNION
{?job :Employer ?Firm}}
Figure 14. The SPARQL translation of the MashQL queries in
Figure 13.
7.2 Use case: eHealth Research
Alice is a PhD student in biology, she wants to search an online
eHealth database, to know what causes prostate cancer most.
First, she wrote a simple query to find all patients that certainly
have a prostate cancer i.e. every person whose prostate biopsy test
is positive. Suppose she found 3500 cases. Then she started to add
and remove restrictions to this query. Her goal was to reach the
closest number to 3500, with a maximum number of relevant
restrictions. Alice found that the restrictions shown in Figure 15
are the indicators for 90% of the people who had cancer.
Notice that we translate MashQL queries that pipe each other
using "CONSTRUCT *", which is not part of the current SPARQL
standard. However, this construct is one of the top proposed
extensions to SPARQL (see [Se]). Otherwise, if this construct will
not be included in the next version of SPARQL, our translation
will return every triple involved in the query patterns.
Figure 15. Alice’s research query.
7.3 Use case: Car Rental
This use case demonstrates a different and an offline use of
MashQL for declaring business rules in an auditing application.
Figure 13. Bob’s MashQL Queries.
7
We assume that both Google and Jobs.ac.uk render their search results in
RDFa (i.e. the RDF triples are embedded in HTML), as many companies
started to do nowadays. However, Bob can also use a third party’s
service (e.g. triplify.org) to extract triples from HTML pages.
The government usually audits whether car rental companies
comply with the local regulations. All companies have to open
their databases for auditing. The auditors visit companies
irregularly, and run a set of queries (called auditing queries) to
discover violations. As each company has different database
design and vocabulary, the auditors have to analyze and
understand the database schema, and write the auditing queries
each time from scratch. The government decided to introduce the
new semantic technology. They built a wrapper that connects to
any database and automatically converts this into one large RDF
table, this takes few minutes to complete. See a sample of such
data in Figure 16. The auditors then use the MashQL editor to
write and execute their set of auditing queries. Figure 17 shows
three examples of such auditing queries.
The first query checks whether there is a car rental that occurred
during a period that this car was not insured. The result was
{<:Rental1>} as the car was not insured during the first 3 days of
the rental. The second query checks whether there is a car rental
occurred for a customer who does not have a driving license; the
result is empty. The third query checks whether there is a car
rental occurred for a customer whose driving license does not
authorize him to drive that type of car ; the result is {<:Rental2>},
because the license of Customer2 is B, while the category of the
car he rented requires license C or higher.
<http://localhost/Company1>
:Vehicle1 rdf:type :Vehicle
:Vehicle1 :PlateNumber “ABC323”
:Vehicle1 :VehicleCategory “C”
:Customer1 :LicenseNumber “8723”
:Vehicle1 :Insurance :InsContract1
:Customer1 :LicenseType “B”
:Vehicle2 rdf:type :Vehicle
:Customer2 rdf:type :Customer
:Vehicle2 :PlateNumber “BDC987”
:Customer2 :CustomerID “3456”
:Vehicle2 :VehicleCategory “B”
:Customer2 :LicenseNumber “8723”
:Vehicle2 :Insurance :InsContract2
:Customer2:LicenseType “B”
:InsContract1 rdf:type :VehicleInsurance
:Rental1 rdf:type :Rental
:InsContract1 :InsuranceType “Full”
:Rental1 :Vehicle :Vehicle2
:InsContract1 :InsuranceSDate 23-11-2007 :Rental1 :Customer :Customer1
:InsContract1 :InsuranceEDate 22-11-2008 :Rental1 :StartOn 20-11-2007
:InsContract2 rdf:type :VehicleInsurance
:Rental1 :EndsOn 25-11-2007
:InsContract2 :InsuranceType “Full”
:Rental2 rdf:type :Rental
:InsContract2 :InsuranceSDate 10-04-2007 :Rental2 :Vehicle :Vehicle1
:InsContract2 :InsuranceEDate 11-04-2008 :Rental2 :Customer :Customer2
:Customer1 rdf:type :Customer
:Rental2 :StartsOn 15-02-2008
:Customer1 :CustomerID “6783”
:Rental2 :EndsOn 16-02-2008
Figure 16. Sample of RDF for a car rental company.
an author that contains "Bob Hacker" or "Hacker B.". Figure 21
shows the result of this query.
http://scholar.google.com/scholar?q=bob+Hacker
<g:3> :Title “Prostate Cancer”
<g:3> :Author “Hacker B., Hacker A.”
<g:4> :Title “Best and Worst Lifestyles”
<g:4> :Atuhor “Bob Hacker”
<g:4> :Cites <g:3>
<g:7> :Title “Protein Categories”
<g:7> :Atuhor “Bob Smith”
<g:7> :Cites <g:3>
<g:7> :Cites <g:4>
<g:8> :Title “Cancer Vaccines”
<g:8> :Atuhor “Alice Hacker”
<g:8> :Cites <g:3>
http://www.citeseer.com/search?s=“Bob Hacker”
_:1 :Title “Prostate Cancer”
_:1 :Author “Hacker B., Hacker A.”
_:2 :Title “Protocols in Molecular Biology”
_:2 :Atuhor “Bob Hacker”
_:2 :ArticleCited _:1
_:3 :Title “Cancer Vaccines”
_:3 :Atuhor “Eve Lee, Bob Hacker”
_:4 :Title “Overview about Systems Biology”
_:4 :Atuhor “Tom Lara”
_:4 :ArticleCited _:1
_:4 :ArticleCited _:2
Figure 18. Sample of RDF data about Bob’s articles.
Figure 19. A MashQL query to mashup citation from different sites.
PREFIX s1: http://scholar.google.com/scholar?q=bob+Hacker
PREFIX s2: http://www.citeseer.com/search?s=“Bob Hacker
SELECT CitingArticle? ?CitedArticle
From <http://scholar.google.com/scholar?q=bob+Hacker>
From <http://www.citeseer.com/search?s=“Bob Hacker”>
WHERE {
{{?X1 s1:Title ?CitingArticle} UNION {?X1 s2:Title ?CitingArticle}}
{{?X1 s1:Author ?X2} UNION {?X1 s2:Author ?X2}}
{{?X1 s1:Cites ?X3} UNION {?X1 s2:ArticleCited ?X3}}
{{?X3 S1:Title ?CitedArticle} Union {?X3 S2:Title ?CitedArticle}
{{?X3 s1:Author ?X4} UNION {?X3 s2:Author ?X4}}
FILTER (regex(?X2,”^Bob Hacker”) || regex(?X2,”^Hacker B.”))}
FILTER Not(regex(?X4,”^Bob Hacker”) || regex(?X4,”^Hacker B.”)) }
Figure 20. The SPARQL translation.
(b)
(a)
CitingArticle
CitedArticle
Protein Categories
Prostate Cancer
Protein Categories
Best and Worst Lifestyles
Cancer Vaccines
Prostate Cancer
Overview about Systems Biology
Prostate Cancer
Overview about Systems Biology
Protocols in Molecular Biology
Figure 21. The query results.
7.5 Use case: Fnac retailer
(c)
Figure 17. Examples of Auditing Queries.
7.4 Use case: Citations List
Bob would like to compile the list of articles that cited his articles
(excluding what he cited himself). He built a mashup using
MashQL to mix his citations retrieved from both Google Scholar
and CiteSeer, and then filter out the self-citations. First, he
performed a keyword search (“Bob Hacker”) on both Google
Scholar and CiteSeer8. Figure 18 shows a sample of the extracted
RDF triples. Bob’s MashQL query is shown in Figure 19, and its
SPARQL equivalent in Figure 20. In this query, Bob wrote:
retrieve every article that has a title (call it CitingArticle), has an
author that does not contain "Bob Hacker" or "Hacker B.", and
cites another article that has a title (call it CitedArticle), and has
8
Similar to the previous use case, we also assume that Google Scholar’s
and CiteSeer’s retrieved results are formatted in RDFa.
Fnac is a large retailer of cultural and consumer electronics
products. When a new product arrives to Fnac, it has to be entered
to the inventory database. This is usually done by scanning the
barcode on each product, and then manually filling the product
specifications. Furthermore, as Fnac trades in many countries,
their product specifications have to be translated into several
languages. To save time entering and translating information
manually, Fnac decided to reuse the product data specifications
(and their translation) that are produced at the factory side. For
example, suppose Fnac received three packages from Cannon,
Alfred, and IMDB. Fnac would like to scan the barcode of the
received products and then get their specifications directly from
the online catalogues of those suppliers. In Figure 22 we show
samples of online product catalogues of the three suppliers (we
assume they are published in RDFa). Figure 23 illustrates a query
that Fnac built to look up the multilingual titles of three products.
This query is a mashup of three RDF data sources with a userinput of three barcode numbers. The query takes each of these
barcodes and finds the English and French titles. Notice that Fnac
assumed that short titles provided by Cannon are in English, thus,
they are joined with the other titles that are tagged with "@en".
See the retrieved results in Figure 25. In this same way, a barcode
reader could be connected with user-input module, to retrieve the
specifications (which could be stored at the supplier side) each
time a product is scanned.
http:www.cannon/products/rdf
_:P1 :ShortName “CanScan 4400F”
_:P1 :FullName “Canon CanoScan
4400F Color Image Scanner”
_:P1 :Producer “Canon”
_:P1 :ShippingWeight> “4 pounds”
_:P1 :Barcode 9780133557022
_:P2 :ShortName “PowerShot SD100”
_:P2 :FullName “Canon PowerShot
SD10007.1MP Camera 3x Zoom”
_:P2 :Producer “Canon”
_:P2 :ShippingWeight> “2 pounds”
_:P2 :Barcode 9781143557532
http://www.alfred.com/books
<:B1> :Type <:Book>
<:B1> :Title “The Prophet”@en
<:B1> :Title “Le prophète”@fr
<:B1> :BCode 8765422097653
<:B1> :Authors “Kahlil Gibran”
<:B1> :ISBN-10 0394404289
<:B3> :Type <:Book>
<:B3> :Title “Alfred Nobel”@en
<:B3> :Title “Alfred Nobel”@fr
<:B3> :BCode 75639898123
<:B3> :Authors “Kenne Fant”
<:B3> :ISBN- 0531123286
http://www.imdb.com/movies
_:1 rdf:Type <:Movie>
_:1 :Title “All about my mother”@en
_:1 :Title “Tout sur ma mère”@fr
_:1 :ProdCode 3248765355133
_:1 :NumberOfDiscs: 1
_:2 rdf:Type <:Movie>
_:2 :Title “Lords of the rings”@en
_:2 :Title “Seigneur des anneaux”@f
_:2 : ProdCode 4852834058083
_:2 :NumberOfDiscs: 3
Figure 22. Sample of RDF data about products
Figure 23. A query to mashup product titles from different resources.
PREFIX s1: <http:www.cannon/products/rdf>
PREFIX s2: <http://www.alfred.com/books>
PREFIX s3: <http://www.imdb.com/movies>
SELECT ?Barcode ?EnglishTitle ?FrenchTitle
FROM <http:www.cannon/products/rdf>
FROM <http://www.alfred.com/books>
FROM <http://www.imdb.com/movies>
WHERE{
{{?x s1:Barcode ?Barcode}UNION
{?x s2:Bcode ?Barcode}UNION
{?x s3:Prodcode ?Barcode}}
FILTER (regex(?Barcode, “9781143557532”) ||
regex(?Barcode, “8765422097653”) ||
regex(?Barcode, “3248765355133)”).
{OPTIONAL {?x s1:ShortName ?EnglishTitle}}UNION
{{OPTIONAL {?x s1:Title ?EnglishTitle}} UNION
{OPTIONAL {?x s2:Title ?EnglishTitle}}
FILTER (lang(?EnglishName) = ”en”)}
{{OPTIONAL {?x s1:Title ?FrenchTitle}} UNION
{OPTIONAL {?x s2:Title ?FrenchTitle}}
FILTER (lang(?FrenchTitle) = ”fr”)}}
Figure 24. The SPARQL equivalent of Figure 23.
EnglishTitle
FrenchTitle
Barcode
9781143557532
CanScan 4400F
8765422097653
The Prophet
Le prophète
3248765355133
All about my mother
Tout sur ma mère
Figure 25. Retrieved product titles.
8. Discussion
8.1 Coverage and Elegance
We have learned from these use cases that some MashQL
constructs (especially the OneOf, Between, and the Union “\”) are
useful to have in the language, because they were used in most
cases. On the other hand, we found that some important constructs
are missing, specially the aggregation functions. Thus, we decided
to extend our approach and use Oracle’s SPARQL, instead of
only the W3C’s SPARQL standard. Oracle’s SPARQL inherits all
functionalities of SQL, including aggregation functions, grouping,
among others.
All SPARQL constructs are somehow supported by MashQL; and
all MashQL constructs can be easily expressed in SPARQL. Some
constructs are mapped directly, but some other constructs require
lengthy emulation. This shows indeed that MashQL constructs are
more concise and intuitive than SPARQL. For example, SPARQL
query scripts containing the union operator require almost twice
the size than their MashQL equivalents. The OneOf operator,
which is represented as a set of constants in MashQL, is emulated
by long SPARQL filters.
The main limitation of MashQL’s elegancy is that in case an RDF
data source contains unwieldy terms, this may yield inelegant
MashQL queries. For example, suppose you want to query this
dataset: {<:C1,:eno,AB12345> <:C1,:eid,987665> <:C1,:efname,Bob>}. The
technical labels of predicates ˗which are difficult to understand by
people that did not create them˗ may yield inelegant or technical
queries. However, we believe that if an RDF source is made
public for external users, predicate vocabularies will be clear and
mostly based on standard RDF vocabularies (such as Dublin Core,
FOAF, GeoRSS, etc.), or based on a shared ontology [J05].
Furthermore, the use cases also taught us that representing queries
visually (as modules connected through pipes) helps non-IT users
to visualize and understand the information flow in their queries.
In other words, this visualization facilitates users to organize and
modularize their queries at a high level of abstraction. For
example, one may notice that Bob can create only one query and
get the same results of the three queries shown in Figure 13.
However, instead of building such a complex query, Bob
preferred to modularize his queries in this way, as it is easier for
him to build and maintain. In short, modularizing queries and
piping them in this way is easier to abstract for non-IT experts.
8.2 Performance and Complexity
Suppose one asks how long it takes to execute the query shown in
Figure 2. The answer is: exactly the same time the query shown in
Figure 1 would take, which is its SPARQL translation. In other
words, although users build and view their mashups using
MashQL, but MashQL queries are not executed themselves, their
SPARQL translation that is executed. The time complexity of
translating a MashQL query into SPARQL is neglectable as it
takes less than a second. Thus, the complexity of executing a
MashQL query is bounded to the complexity and performance of
its backend query language, which is SPARQL in our case.
Executing a SPARQL query over a huge RDF dataset is indeed
very fast. As shown by Oracle in [CDES] and AllegroGraph [All],
a query with a medium size complexity over hundreds of millions
of triples takes only one or few seconds. Oracle has proven indeed
that both querying and bulk-loading of RDF data is scalable
[CDES, DCWAS]. However, a problem may arise in case of
querying remote RDF sources. When a user defines a query over
a set of remote sources, as shown in Figure 1, these sources must
be transferred and stored locally before executing the query
[TPM]. Hence, the time required to execute a query depends
mainly on the amount of the transferred data and the transferring
bandwidth. The performance of using SPARQL (and thus
MashQL) to mash up large remote sources might be unacceptable
in case these sources are very large. However, in most application
scenarios in practice (see our use cases), people would build
mashups on acceptable sizes of data sources. For example,
querying HTML pages that contain RDFa, small RDF files, or
retrieved results from online RDF stores that support SPARQL
endpoints, among many other scenarios. As this topic is related to
the performance of SPARQL, rather than MashQL, we tackle it
separately; see the future work section.
Furthermore, recall that when formulating a MashQL query, there
are some background queries that are used to generate drop-down
lists (see section 3). These queries should be fast enough to allow
efficient query formulation. For example, after specifying the
RDF sources to be queried, users can select the query subject,
which is offered through a drop-down list that is dynamically
generated from the union of the subject and object identifiers in
the data sources. Similarly, predicates and objects are selected
from dynamic lists (see section 3). In general, for a query with n
restrictions, there are at most (2n+1) queries that will be
performed in the background, i.e. during the query formulation
process. The performance of these queries should be fast: as soon
as a user selects from a list and moves the mouse to select from
another list, this list should be ready. As discussed above, in case
the queried sources are stored locally, the performance of the
background queries is indeed fast, as each query takes only one
second. However, in case of remote input sources, these sources
need to be transferred and stored locally, before the query
formulation process starts. Therefore, query formulation in case
of large remote sources might be unacceptable. In section 7, we
shall discuss our future plans to overcome this challenge, which a
SPARQL rather than a MashQL issue.
9. Evaluation
Coming Soon
10. Conclusions and Future Work
SPARQL. Furthermore, as querying large remote sources using
SPARQL matters (see our discussion in the section 5.2), we are
developing a framework called Semantic Web Pipes. This
framework extends MashQL and allows caching remote sources,
materializing query result, distributing queries, publishing, and
discovery of queries, among other issues. Last but yet important,
we also plan to extract schemas from RDF sources and depict
these schemas using ORM and based on [J07a, J07b], which
would be an extra offline support for MashQL users.
Acknowledgement
We are indebted to George Pallis for his valuable comments and
feedback on the early drafts of this paper. This research is
partially supported by the SEARCHiN project (FP6-042467,
Marie Curie Actions).
References
[ACK] Athanasis, N., Christophides, V., and Kotzinos, D.: Generating On
the Fly Queries for the Semantic Web. In Proceedings of the ISWC.
LNCS, Springer. 2004.
[AD] Abiteboul S, Duschkal O: Complexity of Answering Queries Using
Materialized Views. ACM SIGACT-SIGMOD-SIGART. (1998)
[All] AllegroGraph http://www.franz.com/products/allegrograph (May
2008)
[BH] Bloesch, A. and Halpin, T.: Conceptual Queries using ConQuer–II.
In Proceedings of the ER. LNCS, Springer. 1997.
[BHB] Bizer C, Heath T, Berners-Lee T:Linked Data: Principles and State
of the Art. WWW. 2008
[CDES] Chong, E., Das, S., Eadon, G., and Srinivasan, J.: An efficient
SQL-based RDF querying scheme. In Proceedings of VLDB. LNCS,
Springer. 2005
[DCWAS] Das, S., Chong, E., Wu, Z., Annamalai, M., and Srinivasan, J.:
A Scalable Scheme for Bulk Loading Large RDF Graphs into Oracle.
In Proceedings of the ICDE. ACM. 2008.
[CERE] Czejdo, B., Elmasri, R., Rusinkiewicz, M., and Embley, D.: An
algebraic language for graphical query formulation using an EER
model. In Proceedings of the ACM Conference on Computer Science.
1987.
In this article, we have proposed a query-by-diagram language in
order to allow building data mashups intuitively. Not only
MashQL is user-friendly for non-IT people, but also it allows
querying (and navigating) RDF data sources without having to
know the schema or the technical details of these sources. We
have demonstrated the use of MashQL using different use cases,
and discussed the lessons learned regarding the elegancy,
coverage, and performance of MashQL. As we have discussed
and demonstrated, MashQL is not merely an interface of
SPARQL. Although it can be used as such, but it can be used also
as a general query language by its own. In addition, MashQL can
be used also for filtering metadata streams. In this article, we
focus on using the W3C’s SPARQL standard as the backend
query language.
[DMV] De Troyer, O., Meersman, R., and Verlinden, P.: RIDL on the
CRIS Case: A Workbench for NIAM. In Proceedings of the IFIP.8.1
Conference. 1988.
However, we plan to also translate MashQL into Oracle’s
SPARQL. This translation is being implemented as an AJAX
web-based plug-in to Oracle 11g. Choosing Oracle is not only
because of its scalability, but also because Oracle’s SPARQL
inherits all functionalities of SQL, including aggregation and
grouping functions, which are not supported in the standard
[J07b] Jarrar, M.: Mapping ORM into the SHOIN/OWL Description Logic
-Towards a Methodological and Expressive Graphical Notation for
Ontology Engineering. In Proceedings of the OTM’07 Workshops.
LNCS, Springer. 2007
[HPW] Hofstede, A., Proper, H., and van der Weide, T.: Computer
Supported Query Formulation in an Evolving Context. In Proceedings
of the ADC. 1995.
[I] Iskold, A.: Semantic Web: Difficulties with the Classic Approach. The
ReadWriteWeb magazine. Sep. 19, 2007
[iS] iSPARQL http://demo.openlinksw.com/isparql (May 2008)
[J05] Jarrar, M.: Towards Methodological Principles for Ontology
Engineering. PhD Thesis. Vrije Universiteit Brussel. May 2005
[J07a] Jarrar, M.: Towards Automated Reasoning on ORM Schemes Mapping ORM into the DLRidf Description Logic. In Proceedings of
the ER. LNCS, Springer. 2007.
[JD08] Jarrar, M., and Dikaiakos, M. D.: MashQL: A Query-By-Diagram
Language for Data Mashups. Technical Article (No. TAR200805).
University
of
Cyprus,
http://www.jarrar.info/publications/JD08.pdf
2008
[KB] Kaufmann, E., and Bernstein, A.: How Useful Are Natural Language
Interfaces to the Semantic Web for Casual End-Users. In Proceedings
of the ISWC. LNCS, Springer. 2007.
[M] Mathieson SA: Spread the word, and join it up. The Guardian. April
05 2006.
[MPTL] Morbidoni C, Polleres A, Tummarello G, LePhuoc D: Semantic
Web Pipes. Technical Report. DERI (2007)
[O] O'Donoghue, J.: MySpace joins the 'semantic' web. The Web User
online magazine. May 9, 2008
[Ot] O'Reilly, T.: http://radar.oreilly.com/archives/ 2007/02/pipes-andfilters-for-the-inte.html (May 2008)
[PAG] Perez, J., Arenas, M., and Gutierrez, C.: Semantics and Complexity
of SPARQL. In Proceedings of the ISWC. LNCS, Springer. 2006
[P] Prud’hommeaux, E. (ed.): SPARQL Query Language for RDF, W3C
Working Draft, 4 Oct. 2006
[P07] Polleres, A.: From SPARQL to rules (and back). In proceedings of
the WWW. ACM. 2007
[PS] Parent, C., and Spaccapietra, S.: About Complex Entities, Complex
Objects and Object-Oriented Data Models. In Proceedings of the IFIP
8.1 conference. 1989
[Ra] RFAuthor (May 2008)
http://rdfweb.org/people/damian/2001/10/RDFAuthor
[RSBS] Russell, A., Smart, R., Braines, D., and Shadbolt, R.:
NITELIGHT: A Graphical Tool for Semantic Query Construction. In
Proceedings of the Semantic Web User Interaction Workshop
(SWUI). 2008.
[Se]
SPARQL
Extensions
(March
2008)
http://esw.w3.org/topic/SPARQL/Extensions?highlight=%28sparql%29
[Sm] SPARQLMotion http://www.topquadrant.com/sparqlmotion (May
2008)
[TPM] Tummarello, G., Polleres, A., and Morbidoni, C.: Who the FOAF
knows Alice? A needed step toward Semantic Web Pipes. In
Proceedings of the ISWC Workshops. 2007
[Wr] W3C RDB2RDF Incubator Group (Aug. 08)
http://www.w3.org/2005/Incubator/rdb2rdf/
[Y] Yahoo Blog (Mar.08) http://www.ysearchblog.com/ archives/
000527.html
[Z] Zloof, M.: Query-by-Example: A Data Base Language. IBM Systems
Journal, 16(4). 1977
[ZGHW] Zhuge Y, Garcia-Molina H, Hammer J, Widom J: View
Maintenance in a Warehousing Environment. SIGMOD Conf. (1995)
Appendices
A1. MashQL XML Schema
<?xml version="1.0" encoding="UTF-8"?>
</xs:extension>
<xs:element name="Output">
<xs:schema
xmlns:xs="http://www.w3.org/2001/XMLSchema" </xs:complexContent>
<xs:complexType>
elementFormDefault="qualified" attributeFormDefault="unqualified">
</xs:complexType>
<xs:attribute name="Format"/>
<xs:complexType name="ObjectFilter" abstract="true"/>
<xs:complexType name="NotOneOf">
<xs:attribute name="Stylesheet"/>
<xs:complexType name="NotContains">
<xs:complexContent>
</xs:complexType>
<xs:complexContent>
<xs:extension base="ObjectFilter">
</xs:element> </xs:sequence> </xs:complexType>
<xs:extension base="ObjectFilter">
<xs:choice>
<xs:complexType name="Restriction">
<xs:attribute name="Value"/>
<xs:element name="DataItem" maxOccurs="unbounded">
<xs:sequence>
<xs:attribute name="Type"/>
<xs:complexType>
<xs:element name="Predicate" maxOccurs="unbounded">
<xs:attribute name="DataType"/>
<xs:attribute name="Value"/>
<xs:complexType>
<xs:attribute name="Language"/>
<xs:attribute name="Type"/>
<xs:attribute name="Name"/>
</xs:extension>
<xs:attribute name="DataType"/>
<xs:attribute name="Type"/>
</xs:complexContent>
<xs:attribute name="Language"/>
<xs:attribute name="isReturn" type="xs:boolean"/>
</xs:complexType>
</xs:complexType>
</xs:complexType>
<xs:complexType name="Contains">
</xs:element>
</xs:element>
<xs:complexContent>
<xs:element name="UserInput">
<xs:element name="Object" maxOccurs="unbounded">
<xs:extension base="ObjectFilter">
<xs:complexType>
<xs:complexType>
<xs:attribute name="Value"/>
<xs:attribute name="ModuleID"/>
<xs:attribute name="Name"/>
<xs:attribute name="Type"/>
</xs:complexType>
<xs:attribute name="Type"/>
<xs:attribute name="DataType"/>
</xs:element>
<xs:attribute name="isReturn" type="xs:boolean"/>
<xs:attribute name="Language"/>
</xs:choice>
</xs:complexType>
</xs:extension>
</xs:extension>
</xs:element>
</xs:complexContent>
</xs:complexContent>
<xs:element
name="ObjectFilter"
type="ObjectFilter"
</xs:complexType>
</xs:complexType>
minOccurs="0"/>
<xs:complexType name="NotEquals">
<xs:complexType name="OneOf">
</xs:sequence>
<xs:complexContent>
<xs:complexContent>
<xs:attribute name="Prefix"/>
<xs:extension base="ObjectFilter">
<xs:extension base="ObjectFilter">
</xs:complexType>
<xs:attribute name="Value"/>
<xs:choice>
<xs:element name="Pipe">
<xs:attribute name="Type"/>
<xs:element name="DataItem" maxOccurs="unbounded">
<xs:complexType>
<xs:attribute name="DataType"/>
<xs:complexType>
<xs:sequence>
<xs:attribute name="Language"/>
<xs:attribute name="Value"/>
<xs:element
name="Meta"
type="Meta"
minOccurs="0"
</xs:extension>
<xs:attribute name="Type"/>
maxOccurs="unbounded"/>
</xs:complexContent>
<xs:attribute name="DataType"/>
<xs:element name="RDFInput" type="RDFInput" minOccurs="0"
</xs:complexType>
<xs:attribute name="Language"/>
maxOccurs="unbounded"/>
<xs:complexType name="Equals">
</xs:complexType>
<xs:element name="UserInput" type="UserInput" minOccurs="0"
<xs:complexContent>
</xs:element>
maxOccurs="unbounded"/>
<xs:extension base="ObjectFilter">
<xs:element name="UserInput">
<xs:element
name="Query"
type="Query"
minOccurs="0"
<xs:attribute name="Value"/>
<xs:complexType>
maxOccurs="unbounded"/>
<xs:attribute name="Type"/>
<xs:attribute name="ModuleID"/>
</xs:sequence>
<xs:attribute name="DataType"/>
</xs:complexType>
<xs:attribute name="ID" type="xs:integer" use="required"/>
<xs:attribute name="Language"/>
</xs:element>
</xs:complexType>
</xs:extension>
</xs:choice>
</xs:element>
</xs:complexContent>
</xs:extension>
<xs:complexType name="Meta">
</xs:complexType>
</xs:complexContent>
<xs:attribute name="Name"/>
<xs:complexType name="NotLessThan">
</xs:complexType>
<xs:attribute name="Content"/>
<xs:complexContent>
<xs:complexType name="SubQuery">
</xs:complexType>
<xs:extension base="ObjectFilter">
<xs:sequence>
<xs:complexType name="RDFInput">
<xs:attribute name="Value"/>
<xs:element name="Restrictions" type="Restriction"/>
<xs:sequence>
<xs:attribute name="Type"/>
</xs:sequence>
<xs:element name="Source" maxOccurs="unbounded">
<xs:attribute name="DataType"/>
</xs:complexType>
<xs:complexType>
</xs:extension>
<xs:complexType name="Body">
<xs:sequence>
</xs:complexContent>
<xs:sequence>
<xs:element name="Ref" type="xs:anyURI"/>
</xs:complexType>
<xs:element name="Subject">
<xs:element name="LastUpload" type="xs:dateTime"/>
<xs:complexType name="LessThan">
<xs:complexType>
<xs:element name="CashedAt" minOccurs="0"/>
<xs:complexContent>
<xs:attribute name="Name" use="required"/>
</xs:sequence>
<xs:extension base="ObjectFilter">
<xs:attribute name="Type" use="required"/>
<xs:attribute name="Order" type="xs:integer"/>
<xs:attribute name="Value"/>
<xs:attribute name="isRetrun" type="xs:boolean" use="required"/>
</xs:complexType>
</xs:element> </xs:sequence>
<xs:attribute name="Type"/>
</xs:complexType>
<xs:attribute name="ID" type="xs:integer" use="required"/>
<xs:attribute name="DataType"/>
</xs:element>
<xs:attribute name="Y1" type="xs:integer"/>
</xs:extension>
<xs:element
name="Restriction
"
type="Restriction" <xs:attribute name="X1" type="xs:integer"/>
</xs:complexContent>
maxOccurs="unbounded"/>
<xs:attribute name="X2" type="xs:integer"/>
</xs:complexType>
</xs:sequence>
<xs:attribute name="Y2" type="xs:integer"/>
<xs:complexType name="NotBetween">
</xs:complexType>
</xs:complexType>
<xs:complexContent>
<xs:complexType name="Header">
<xs:complexType name="UserInput">
<xs:extension base="ObjectFilter">
<xs:sequence>
<xs:sequence>
<xs:attribute name="minValue"/>
<xs:element name="Meta" type="Meta" minOccurs="0"/>
<xs:element name="DataItem" maxOccurs="unbounded">
<xs:attribute name="maxValue"/>
<xs:element
name="Prefix"
minOccurs="0"
<xs:complexType>
<xs:attribute name="Type"/>
maxOccurs="unbounded">
<xs:attribute name="Value"/>
<xs:attribute name="DataType"/>
<xs:complexType>
<xs:attribute name="Type"/>
</xs:extension> </xs:complexContent> </xs:complexType>
<xs:attribute name="Tag" use="required"/>
<xs:attribute name="DataType"/>
<xs:complexType name="Between">
<xs:attribute name="Ref" use="required"/>
<xs:attribute name="Langauge"/>
<xs:complexContent>
</xs:complexType>
</xs:complexType>
</xs:element> </xs:sequence>
<xs:extension base="ObjectFilter">
</xs:element>
<xs:attribute name="ID" type="xs:integer" use="required"/>
<xs:attribute name="minValue"/>
</xs:sequence> </xs:complexType>
<xs:attribute name="X1" type="xs:integer"/>
<xs:attribute name="maxValue"/>
<xs:complexType name="Footer">
<xs:attribute name="Y1" type="xs:integer"/>
<xs:attribute name="Type"/>
<xs:sequence>
<xs:attribute name="X2" type="xs:integer"/>
<xs:attribute name="DataType"/>
<xs:element name="Order">
<xs:attribute name="Y2" type="xs:integer"/>
</xs:extension> </xs:complexContent> </xs:complexType>
<xs:complexType>
</xs:complexType>
<xs:complexType name="NotMoreThan">
<xs:sequence>
<xs:complexType name="Query">
<xs:complexContent>
<xs:element name="Variable">
<xs:sequence>
<xs:extension base="ObjectFilter">
<xs:complexType>
<xs:element name="Header" type="Header" minOccurs="0"/>
<xs:attribute name="Value"/>
<xs:attribute name="Name"/>
<xs:element name="Body" type="Body"/>
<xs:attribute name="Type"/>
<xs:attribute name="Direction"/>
<xs:element name="Footer" type="Footer" minOccurs="0"/>
<xs:attribute name="DataType"/>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:extension> </xs:complexContent> </xs:complexType>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:attribute name="ID" type="xs:integer" use="required"/>
<xs:complexType name="MoreThan">
<xs:element name="Modifier">
<xs:attribute name="InputModule" type="xs:integer" use="required"/>
<xs:complexContent>
<xs:complexType>
<xs:attribute name="X1" type="xs:integer"/>
<xs:extension base="ObjectFilter">
<xs:attribute name="Duplication"/>
<xs:attribute name="Y1" default="xs:integer"/>
<xs:attribute name="Value"/>
<xs:attribute name="Limit"/>
<xs:attribute name="X2" type="xs:integer"/>
<xs:attribute name="Type"/>
<xs:attribute name="Offset"/>
<xs:attribute name="Y2" type="xs:integer"/>
<xs:attribute name="DataType"/>
</xs:complexType>
</xs:complexType>
</xs:element>
</xs:schema>
A2. The markup of the Job-Seeking use case
<?xml version="1.0" encoding="UTF-8"?>
<Pipe ID="0" xsi:noNamespaceSchemaLocation="MashQL.xsd"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Meta Content="Title" Name="Bob's Jobs"/>
<Meta Content="Creator" Name="Bob Hacker"/>
<Meta Content="DateCreated" Name="2008-05-17T09:30:47.0Z"/>
<Meta Content="DateExecuted" Name="2008-05-18T08:22:10.0Z"/>
<Meta Content="UserID" Name="012345"/>
<RDFInput ID="1" X1="10" Y1="10" X2="110" Y2="60">
<Source Order="0">
<Ref>http://base.google.com/jobs/rdf</Ref>
<LastUpload>2001-12-17T09:30:47.0Z</LastUpload>
</Source>
</RDFInput>
<RDFInput ID="2" X1="210" Y1="10" X2="210" Y2="60">
<Source Order="0">
<Ref>http://jobs.ac.uk/rdf</Ref>
<LastUpload>2001-12-17T09:30:47.0Z</LastUpload>
</Source>
</RDFInput>
<RDFInput ID="6" X1="210" Y1="10" X2="210" Y2="60">
<Source Order="0">
<Ref>:4</Ref>
<LastUpload>2001-12-17T09:30:47.0Z</LastUpload>
</Source>
<Source Order="0">
<Ref>:5</Ref>
<LastUpload>2001-12-17T09:30:47.0Z</LastUpload>
</Source>
</RDFInput>
<UserInput ID="3" X1="230" Y1="160" X2="300" Y2="280">
<DataItem Value="UK" Type="Constant" DataType="" Langauge=""/>
<DataItem Value="Belgium" Type="Constant" DataType="" Langauge=""/>
<DataItem Value="Germany" Type="Constant" DataType="" Langauge=""/>
<DataItem Value="Austria" Type="Constant" DataType="" Langauge=""/>
<DataItem Value="Holland" Type="Constant" DataType="" Langauge=""/>
</UserInput>
<Query ID="4" X1="0" Y1="70" X2="150" Y2="170" InputModule="1">
<Header>
<Meta Content="Title" Name="Google Jobs"/>
<Prefix Ref="" Tag=""/>
</Header>
<Body>
<Subject Name="Job" Type="Variable" isRetrun="false"/>
<Restriction Prefix=" ">
<Predicate Name="JobIndustry" Type="Identifier" isReturn="false" />
<Object Name="X1" Type="Variable" isReturn="false" />
<ObjectFilter xsi:type="OneOf" >
<DataItem Value="Education" Type="Constant" DataType="" Language="" />
<DataItem Value="HealthCare" Type="Constant" DataType="" Language="" />
</ObjectFilter>
</Restriction>
<Restriction Prefix=" ">
<Predicate Name="Type" Type="Identifier" isReturn="false" />
<Object Name="X2" Type="Variable" isReturn="false" />
<ObjectFilter xsi:type="OneOf" >
<DataItem Value="Full-Time" Type="Constant" DataType="" Language="" />
<DataItem Value="FullTime" Type="Constant" DataType="" Language="" />
<DataItem Value="Contract" Type="Constant" DataType="" Language="" />
</ObjectFilter>
</Restriction>
<Restriction Prefix=" ">
<Predicate Name="Currency" Type="Identifier" isReturn="false" />
<Object Name="X3" Type="Variable" isReturn="false" />
<ObjectFilter xsi:type="OneOf" >
<DataItem Value="^Euro" Type="Constant" DataType="" Language="" />
<DataItem Value="^€" Type="RegexConstant" DataType="" Language="" />
</ObjectFilter>
</Restriction>
<Restriction Prefix=" ">
<Predicate Name="Salary" Type="Identifier" isReturn="false" />
<Object Name="X4" Type="Variable" isReturn="true" />
<ObjectFilter xsi:type="Between" minValue="75000" maxValue="120000"
Type="Constant" DataType="" />
</Restriction>
</Body>
<Footer>
<Order>
<Variable Name="" Direction=""/>
</Order>
<Modifier Limit="" Offset="" Duplication=""/>
<Output Stylesheet="" Format=""/>
</Footer>
</Query>
<Query ID="5" X1="160" Y1="70" X2="300" Y2="170" InputModule="2">
<Header>
<Meta Content="Title" Name="Jobs.uk"/>
<Prefix Ref="" Tag=""/>
</Header>
<Body>
<Subject Name="Job" Type="Variable" isRetrun="false"/>
<Restriction Prefix=" ">
<Predicate Name="Category" Type="Identifier" isReturn="false" />
<Object Name="X5" Type="Variable" isReturn="false" />
<ObjectFilter xsi:type="OneOf" >
<DataItem Value="Health" Type="Constant" DataType="" Language="" />
<DataItem Value="BioSciences" Type="Constant" DataType="" Language="" />
</ObjectFilter>
</Restriction>
<Restriction Prefix=" ">
<Predicate Name="Type" Type="Identifier" isReturn="false" />
<Object Name="X2" Type="Variable" isReturn="false" />
<ObjectFilter xsi:type="Equals" Value="Research/Academic" Type="
Constant" DataType="" Language="" />
</Restriction>
<Restriction Prefix=" ">
<Predicate Name="SalaryCurrency" Type="Identifier" isReturn="false" />
<Object Name="X6" Type="Variable" isReturn="false" />
<ObjectFilter xsi:type="Equals" Value="UPK" Type="Constant" DataType="" Language="" />
</Restriction>
<Restriction Prefix=" ">
<Predicate Name="Salary" Type="Identifier" isReturn="false" />
<Object Name="X4" Type="Variable" isReturn="true" />
<ObjectFilter xsi:type="MoreThan" Value="50000" Type="Constant" DataType="" />
</Restriction>
</Body>
<Footer>
<Order>
<Variable Name="" Direction=""/>
</Order>
<Modifier Limit="" Offset="" Duplication=""/>
<Output Stylesheet="" Format=""/>
</Footer>
</Query>
<Query ID="7" X1="80" Y1="180" X2="200" Y2="250" InputModule="6">
<Header>
<Meta Content="Title" Name="Job Seeking"/>
</Header>
<Body>
<Subject Name="Job" Type="Variable" isRetrun="true"/>
<Restriction Prefix="">
<Predicate Name="Location" Type="Identifier" isReturn="false" />
<Object Name="X9" Type="Variable" isReturn="false"/>
<ObjectFilter xsi:type="OneOf" >
<UserInput ModuleID="3"/>
</ObjectFilter>
</Restriction>
</Body>
<Footer>
<Order>
<Variable Name="" Direction=""/>
</Order>
<Modifier Limit="" Offset="" Duplication=""/>
<Output Stylesheet="" Format=""/>
</Footer>
</Query>
</Pipe>