A Layered Graph Model for Internet Descriptions

A Layered Graph Model for Internet Descriptions
by A. R. Kakebeen; Radboud University of Nijmegen 2004
1. Introduction
As the Internet grows and grows at a frightening rate, the need for order in this chaos
becomes greater. The field of Information Retrieval attempts indexing huge amounts
of data, and defining efficient ways to search it. The problem with Internet is that it is
very loosely structured. We can find both professionally made and amateur websites,
in various styles and layout, covering any kind of information imaginable. But the
Internet does not have to be complete chaos. In the next evolutionary step of the
Internet, the “Semantic Web” [5], huge amounts of data will be made available along
with descriptive information called metadata. It will be easier to automatically process
available Web content and services through better knowledge about the meaning,
usage, accessibility and quality of available Web resources.
The Resource Description Framework (RDF) [6] enables the creation and exchange of
resource metadata just as any other Web content. It provides three things:
1. a Standard Representation Language for metadata based on directed labeled
graphs in which nodes are called resources and edges are called properties;
2. a Schema Definition Language (RDFS) for creating vocabularies of labels for
these graph nodes and edges. These are called classes and property types,
respectively;
3. an XML syntax for expressing metadata and schemas in a form both readable
and understandable by humans and machines.
G. Karvounarakis et. al. [2] describe a declarative language for smoothly querying
these RDF resource descriptions and related schemas. This paper is inspired by that
article, but not heavily based on it. Instead of trying to follow every step made in the
article, we will instead delve into the basis of RDF schemas: graph theory. We will try
to find a solution to the problem: how to describe internet data using graph
theory?
In chapter 2, we will look at pure graph theory, along with a few examples to easily
see its usage. It is nice for solving certain problems, but describing something as
complex as the Internet requires a more complex system.
Chapter 3 expands on the first model, introducing directed, labeled graphs. This is
close to the theory described by P. van Bommel in his Elementary coordination and
modelling of semistructured databases [3].
Next, we gently introduce typing in chapter 4. We expand our graph model with
several new graphs called type graphs, useful for defining various different outlooks
on the same data. That can be very handy if people with different backgrounds and
objectives need to use the information system.
Finally, chapter 5 shows several options for further improving the type system defined
in chapter 4. We will briefly look at these options and offer suggestions to define a
final model.
2. Graphs
2.1 Königsberg Bridge Problem
Graphs can be used to mathematically model and solve various concrete, practical
problems. An example is the famous Königsberg Bridge Problem, which was
imaginatively solved by the Swiss mathematician Léonhard Euler (1707-1783). It
involved seven bridges spanning a river in the capital city of East Prussia, three
hundred years ago. The problem was finding a route so that each bridge was crossed
exactly once (without resorting to swimming of course – the nobles touring town
wouldn’t want to get their expensive clothing all wet) and ended at the starting
position.
Figure 1: Königsberg Bridge Problem
The first step in solving this problem is representing the complex city of Königsberg
as just a few dots and lines, where each dot is a land mass and each line is a bridge
between two land. The second step is finding a path (a so-called Eulerian circuit in
this case) meeting all requirements, or in this case, proving it can’t be done!
Figure 2: Graph for the Königsberg Bridge Problem
2.2 Graphs Defined
There are many more problems that can be solved using graphs, but that is not the
focus of this paper. We will be using graphs to represent searchable data on the web.
Before we do that, let’s start with the basic definition of a graph [1].
Definition 1: Graph
G1 = <V, E, L>
A Graph G1 consists of three sets V , E and L.
V is the nonempty set of vertices and can consist of any number of labels from
the set L: V  L.
E is the set of edges, with each element being an unordered pair of distinct
elements of V: E  {{v, w} | v  w, v, w  V}.
L is the set of labels. You could leave out L from the graph definition, because
V and L will usually be equivalent. However, for some problems it may be
useful to include more labels in the set L, not all of them used for edges.
Thus, if e is an edge, then e is a set of the form e = {v, w}, where v and w are different
elements of V. Often such an edge is simply written as vw, which is the same as wv.
With a formal definition of a Graph, we can draw a graphical (hence the name!)
representation of it. Let’s take for example graph G with:
V = {belgium, france, germany, luxembourg, netherlands},
E = {{belgium, france}, {belgium, germany}, {belgium, luxembourg}, {belgium,
netherlands}, {france, germany}, {france, luxembourg}, {germany, luxembourg},
{germany, netherlands}},
L=V
This may look quite confusing, but a simple drawing fixes that:
Figure 3: Europe Graph
What we see is a map of a part of Europe. If a vertex is connected with another vertex
(using an edge), its country borders the other one. We can only see position relative to
each other – it is impossible to tell, for instance, whether it is faster to drive from
Netherlands to France via Belgium or via Germany.
In general, we can use these very simply defined graphs to represent some
information. Each vertex stands for an object. Edges between vertices mean these
vertices are somehow interrelated – in what way needs to be defined seperately.
3. Directed, labeled graphs
3.1 Refining the graph model
The model described above has several limits and the most blatant is the total lack of
distinction between edge types. In our Europe example, that’s not so bad – every
vertex is a country, which may or may not be bordering another country. But what if
we want to model something a little more complex? Take for example an average
information system about some subject, like a library [8]. It obviously has books as
objects, but we’ll want to store a little more information than just which book is
adjacent to which. A book has an author, publisher, page count, etc. In a library books
can be available or away, lended to customers. Some information about customers
needs to be stored as well.
These are all data we need to model to be able to effectively search it later. To this
end, we’ll introduce two new concepts in our graph definition G2: directed edges and
labeled edges.
In our first definition we represented an edge simply by an unordered pair of vertices.
For our next definition, we simply cannot do that anymore. Instead, we’ll use an
abstract set E = {e1, e2, e3, …} containing all edges used. A special function called
the incidence function ψ maps edges to ordered pair of vertices: ψ(e) = <v,w>.
Likewise, vertices are no longer simply labels, since edges may also be labeled. There
will be a labeling function λ that maps vertices and edges to labels: λ(x) = l, where x
can be a vertex v or an edge e.
Instead of a web of equally related objects, we create some structure with objects and
their properties. The library for instance has several properties: a collection of books,
articles and customers, but most likely also a name and address. Vertices will
represent objects, while edges represent properties.
Definition 2: Directed, labeled graph
G2 = <V, E, L, ψ, λ>
V is a set of vertices corresponding to abstract or concrete data items.
E is a set of directed edges.
L is a set of labels describing the nature of relations represented by the
corresponding edges, or giving value to concrete data items.
ψ is the incidence function connecting vertices with edges: E -> V x V
λ is the labeling function assigning labels to edges or vertices: E  V -> L
If ψ(e) = <v, w>, then there is an edge e from v to w, meaning w is a property of v.
The graphical representation is an arrow from v to w. Edge labels describe what kind
of property they represent. Vertex labels give actual value to the property and are
object names. Names for edges and vertices do not have to be unique (i.e. there can be
more than just one book with an author).
Figure 4: Example Graph “The Library”
3.2 Example: The Library
Let’s look at an example, the library. Libraries have a large collection of books,
papers and magazines, and often provide numerous other services related to
information retrieval. Most modern libraries also have a website which customers can
use to browse through the available collection, or maybe it provides articles only
available online. Thus, the library is a good example for internet description graphs.
Figure 4 shows a very small part of how such a graph could look. There is a top node
v1 which represents the library. It has several properties, of which “location” an
“collection” are shown. Note that v1 has two outgoing edges, one for “location” and
one for “collection. Other properties our library could have are customers and
employees. Each of those four properties can have more properties in turn. The library
has a location, consisting of a city and street address. Although not shown in the
example graph, it would probably also have a country, one or more telephone
numbers and perhaps even another location for multi-building libraries.
The collection consists of many books and probably also newspapers, articles and
magazines, each with properties like title, one or more authors, publication date,
version, etc. Every ‘end-note’ in the example graph has a value label: there is a book
called “The Dreamcatcher” written by “Stephen King”, for instance.
3.3 Constraints on the directed labeled graph model
This leads us to a number of constraints to be added to definition 2. The data model
isn’t just some directed, labeled graph; it is a rooted tree of objects and properties.
The top vertex (‘root’) represents the data domain – in the example, the library. All
vertices at the bottom, without outgoing edges, (‘leaves’) are concrete objects in the
data domain: titles of books, authors, cities, etc. Such a vertex will usually have a
label with its value, although this is not a requirement. You might have a data model
under construction, for example, and not want to complete everything yet.
All vertices in between are abstract objects which do not have labels. Only abstract
objects (including the root) may have properties: edges from this vertex to another.
These edges should always have a label noting what kind of property they represent.
It would be pretty much pointless to have a property without knowing what it is,
wouldn’t it?
Definition 3: Tree Graph based on G2
G3 = <V, E, L, ψ, λ, r>
V is a set of vertices corresponding to abstract or concrete data items.
E is a set of directed edges.
L is a set of labels describing the nature of relations represented by the
corresponding edges, or giving value to concrete data items.
ψ is the incidence function connecting vertices with edges: E -> V x V
λ is the labeling function assigning labels to edges or vertices: E  V -> L
Properties must be named, meaning any edge e must be labeled: ψ(e) =
<v, w> => lL [λ(e) = l]
Only leaves may have value labels, meaning any vertex v with a label l
has no outgoing edges e: lL [λ(v) = l] => eE,wV [ψ(e) = <v,
w>]
rV is the root of the tree.
A root has no incoming edges: eE,vV. [ψ(e) = <v, r>]
There is only one root. All other vertices have exactly one incoming
edge: vV-r !eE,wV. [ψ(e) = <w, v>]
3.4 Example definition
Let’s go back to the library example described above, and try to write down the
definition for it. To avoid confusion, we’ll name this graph GL.
GL = <V, E, L, ψ, λ, r>
There are twelve numbered vertices:
V = {v1,v2,v3,v4,v5,v6,v7,v8,v9,v10,v11,v12}.
The top vertex is the root:
r = v1.
Now we should define the set of edges E. This may be a little tricky. In the picture
above all edges are labeled (“collection”, “book”, etc), but those aren’t the edges!
Like vertices V, edges E are abstract:
E = {e1,e2,e3,e4,e5,e6,e7,e8,e9,e10,e11}.
Knowing that, it is easy to write the set of labels:
L = {“location”, “collection”, “city”, “address”, “book”, “title”, “author”,
“Ede”, “Stadspoort 2”, “The Dreamcatcher”, “Stephen King”, “Dragons of a
Lost Moon”, “Margaret Weis”, “Tracy Hickman”}.
The functions ψ and λ do the most interesting work, actually creating a tree from the
building blocks defined above, and giving it a meaning.
ψ(e1) = <v1,v2>, ψ(e2) = <v1,v3>, ψ(e3) = <v2,v4>, ψ(e4) = <v2,v5>, ψ(e5) =
<v3,v6>, ψ(e6) = <v3,v7>, ψ(e7) = <v6,v8>, ψ(e8) = <v6,v9>, ψ(e9) = <v7,v10>,
ψ(e10) = <v7,v11>, ψ(e11) = <v7,v12>.
λ(e1) = “location”, λ(e2) = “collection”, λ(e3) = “city”, λ(e4) = “address” , λ(e5)
= λ(e6) = “book” , λ(e7) = λ(e9) = “title” , λ(e8) = λ(e10) = λ(e11) = “author”,
λ(v4) = “Ede” , λ(v5) = “Stadspoort 2” , λ(v8) = “The Dreamcatcher” , λ(v9) =
“Stephen King” , λ(v10) = “Dragons of a Lost Moon” , λ(v11) = “Margaret
Weis” , λ(v12) = “Tracy Hickman”.
4. Typed graphs
4.1 Semi-structured data models
In the previous chapter we looked at an extension to the graph model described in
chapter 2. It resembles the model P. van Bommel describes in [3], an object graph for
semi-structured data. This method is well suited for modelling web-based resources,
as this author has done in [4]. P. van Bommel shows various ways to analyze object
graphs for semi-structured data. These may be interesting to incorporate in our
models, but such is beyond the scope of this article.
4.2 Why typing?
However, there are still some ways to furter extend this model. In [2], G.
Karvounarakis et al show how a data model can consist of several graphs: one object
graph and one or more type graphs. The latter are used to present only relevant
information to specific usergroups. For instance, in the library customers will be
mainly interested in the library’s collection, while the system administrator needs to
know file sizes and meta data. The article has an excellent example graph for an art
portal website.
The article’s example describes a portal website for various art musea. These musea
exhibit many different art objects, from paintings to sculptures and ancient Incan
jewelry. We can make an object graph as normal, with vertices for each object and for
their properties. Apart from that, the graph also displays where art objects are
physically displayed and on what web resource they appear. Web resources have
various meta data like file size, file name and last modified date.
The example also lists two type graphs: one for visitors (who are generally interested
in the art itself) and one for system administrators (who are generally interested in
meta data). Type graphs are related to the object graph. We’ll see how that works in
the next section, expanding the library example.
Typed graphs are also a good control mechanism for system administrators or
programmers: you can more easily see if something is wrong. No more “Stephen
King” as a book’s publishing date!
4.3 Typing the library
Figure 5: Expanded Library graph
In figure 5, we have changed the library graph to better work with typing. There is
one important change from the previous model: it is no longer rooted. It is now a web
of information, with multiple dependencies between objects.
In the example, we see a book called “The Hobbit”, of which the “Stadspoort” library
department has two copies. One is currently available, but the other is out with a
customer. Another book, “Lord of the Rings”, can be found at two different
departments: “Bennekom” and “Centrale.” Both books are written by the same author.
This change isn’t strictly necessary - it is possible, although visually unattractive, to
incorporate a hierarchy and make it a rooted tree again. These steps may be taken in a
later model.
The example also lists a CD by “Enya”, with a new property called “Duration.” Books
never have a duration (although they in turn have a page count, not listed in this small
example). Apparently there is a difference between these objects. We will show this
in a type graph.
For this, the object graph from figure 5 should change a little more. Currently we
show properties like “Title”, “Status” and “Location.” These are actually the types of
the end nodes of each edge. Each edge should instead list a string to connect the two
vertices. For instance, where we previously had an edge called “Author”, we can
instead use “writes” or “is-written-by”, depending on what way the edge is directed.
Above figure did not use directed edges for simplicity’s sake. In the example it would
not be confusing, but in other cases it surely will be!
Figure 6: Library Object graph
Figure 7: Library Type graph
Figure 6 and 7 together form our newly made ‘graph’. In this example the relation
between the two graphs should be obvious. Otherwise the two need to be connected
somehow, making a single, rather messy, graph. The next chapter will discuss ways to
do this.
The object graph displayed as figure 6 should be pretty much self-explaining when
related to figure 5. We have basically only changed the edge names. Figure 7 is a bit
more interesting. Every vertex has a type - its name starts with a capital. Most ‘endnodes’ or ‘leaves’ of the web are of type String, with the only exception being “Time”
for duration. These types could conceivably be refined more, including expanding the
graph itself. For example, each department is located in some city. Now who says that
city must be denoted by a string only? We could link our system with another
database containing information about the city - possibly even the city’s website
itself. Linking such data is exactly why standardization is so important, as briefly
discussed in [2]. Chapter 5 of this paper goes a little further into this subject.
The example uses only one type graph, covering all data in the object graph. As
discussed before, there can be more than one type graph, usually for various user
types of the system. These type graphs will always be a subgraph of the total type
graph depicted in figure 7.
4.4 Definitions
Definition 4: Typed graphs
The model consists of two or more graphs. One is the object graph, similar to
the one in definition 3. All the others are type graphs.
G4 = <Go, Gt1, ..., Gtn, >
Go is the object graph, defined as follows:
Go = <Vo, Eo, Lo, ψo, λo>, similar to definition 2
Gt1, ..., Gtn are type graphs
Gti = <Vti, Eti, Lti, ψti, λti>, for every 1 <= i <= n
 is the typing relation connecting object vertices to type vertices. This cannot
be a function, because an object vertex can be connected to multiple type
vertices; one in every type graph. However, we could split the -relation in
multiple functions: 1, ..., n. i can be a partial function.
i (x) = t, where x Vo and tVti.
5. Further ways to improve the model
There are various ways to improve this model. We will discuss expanded typing,
query design and graph optimization.
5.1 Expanded typing
5.1.1 Subtyping
The current model allows us to define a type for every object. A book has a title
which is a String. A department is in a city which is also a String. It is very well
possible for a library in “Ede” to own a book called “Ede.” But are these the same?
Not quite. Depending on the context of the information system, it may be desirable to
further define subtypes. This way we can limit what labels go with what vertex.
Remember that vertices are already so typed: the label of the edge and the type of the
vertex together define its type. When an information system is edited or updated a lot
by various users, one can limit the types of input accepted by type checking. Instead
of hardcoding it into the user interface, it can be a part of the data model.
To implement this, the set of labels L is divided into several subsets; one for each
type, and every type is further divided into subtypes: Ltype,subtype. Every element of L
must be in at least one Ltype,subtype. Every element of every Ltype,subtype must be in L.
Type corresponds to the vertex’s type, such as String or Time. Subtype corresponds to
the edge’s label, such as city or duration.
5.1.2 Supertyping
The model described in chapter 4 consists of two or more seperate non-rooted graphs.
It is possible to change our model to make it one rooted graph. The RDF scheme
specification [6] incorporates this by adding metatypes, serving as supertypes for the
objects in the graphs, and introducing diverse edge types. Although we do not want to
copy RDF, we can still add these concepts to the definition.
The first step is to add a class hierarchy, similar to what object-oriented programming
languages like Java [7] use, to the complete type graph. In the library example, there
are several artifacts available for lending: books, papers, CDs, comics, etc. Figure 8
shows a way to organize all the different artifacts available at the library into a
hierarchy.
Figure 8: Library type hierarchy
Add vertices to the complete type graph for every relevant class identified. Next, we
define a new type of edge to connect these vertices. This new edge means that one
vertex is a subtype of another vertex (or the latter is a supertype of the former). When
vertices can be subtypes of other vertices, so can edges be to other edges. For
instance, an edge ‘author’ and an edge ‘componist’ are both subtyped to ‘artist’. We
can introduce a third type of edge to graphically represent such relations.
Now all vertices in the object graph can be connected to the type graph, using the relation to determine which should be connected to which. The relation is replaced in
the definition by the new set of edges E.
The final step is to add a root like ‘Object’ or ‘Library’ to the graph, as was used in
definition 3. Now we have a directed, labeled, typed and rooted graph.
5.2 Query design
In his “Elementary coordination and modelling of semistructured databases”[3], P.
van Bommel describes many mathematical tools for working with semistructured
data. One interesting tool is path expressions. With proper typing in place, like in
definition 4, a path expression is a powerful way of extracting information from the
data model and representing the data graphically to the user. The library’s visitors
often consult a digital catalogue, sometimes on site and sometimes from the web, to
find objects interesting to them. They now do this by searching on title, author,
subject or keyword. Path expressions give them an understandable way to use more
refined queries, perhaps even with natural language.
Figure 9: "Tolkien writes Hobbit" path expression
Figure 9 shows the subgraph returned when querying “Tolkien writes Hobbit.”
Because labels are properly typed, the system recognizes “Tolkien” as part of the
author J.R.R. Tolkien, “Hobbit” as part of the book title “The Hobbit” and “writes” as
a property label. It matches the path expression with the graph and returns figure 9.
The graphical user interface (GUI) can then use this information to tell the visitor
“Library department Stadspoort has two copies of this book, and one is available.” It
could then give instructions of where exactly the book can be found (not included in
example graph in this paper).
A lot more can be said about this subject, but it is beyond the scope of this paper. G.
Karvounarakis et. al. [2] specifically propose a querying language for RDF. P. van
Bommel [3] elaborates on path expressions.
5.3 Graph optimization
Lastly, we will look at ways to optimize and use the graphical representation of our
model. 5.2.1 already handles combining the multiple graphs into one single tree
representation. This might not be the most convenient way to show data to the user,
though. With all the added edges and vertices for supertypes, it quickly starts looking
chaotic - and chaos is what we were trying to avoid in the first place! That is why one
should probably never show the complete graph at once to a single user.
Instead, show only what is necessary, either by usertype or query. Figure 9’s path
expression illustrates this well. It shows information directly related to the visitor’s
query. He or she might want to know more about the subject or author, and can do
this by formulizing a new query, either by hand or through the user interface. Figuring
out what an information sytem’s GUI should show to the user is an entirely seperate
subject, only barely related to Computing Science.
6. Conclusions
We have discussed the evolution of the web and its dissolution into chaos. In four
steps, we have built a model using graph theory to describe internet data. Or actually,
we have just laid down the basics - the paper does not discuss extracting such data
from real web pages, whether manually or automatically, nor actual implementation
of a system based on this graph theory.
The article [2] continues on the model by designing a query language for RDF
description bases. We will leave the exploration to other scientists and keep at this
exercise.
While at first this article started as a straightforward exercise of building and
expanding models upon one another, it has also touched many other subjects worth
exploring. We have seen problem-solving through graphical representation,
semistructured data models, path expressions, querying and information retrieval,
typing, object-oriented design, and even human-machine interfaces. We haven’t even
touched representing these data models as a real database, with time and space
constraints for real applications.
There is one subject sticking out of the rest, one that this author would like to briefly
note in the conclusions section: linking multiple graphs for querying search engines.
As it stands now, we have a data model that allows us to search information of a
limited scope. But why should it be limited to that? Our library might get a visitor
looking for a certain book, which happens to be available in Ede... and then he wants
to know more about the city Ede. The library’s information system probably does not
contain anything about this matter, but the city’s own website likely does. Thus, it
should be possible to link these systems, even for cross-system queries (“I am looking
for a pizza restaurant with a view on the Dutch mills”). This is an exciting subject and
should be possible in the not-so-far future.
7. References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
E.G. Goodaire, M.M. Parmenter; Discrete Mathematics with Graph Theory;
1998 Prentice Hall; ISBN 0-13-602079-8
G. Karvounarakis, A. Magganaraki, S. Alexaki, V. Christophides, D.
Plexousakis, M. Scholl, K. Tolle; Querying the Semantic Web with RQL; in:
Computer Networks 42 (2003), pp 617-640
P. van Bommel; Elementary coordination and modelling of semistructured
databases; University of Nijmegen 04-28-2004
A.R. Kakebeen; Semi-gestructureerde data: websites (in Dutch); University of
Nijmegen 06-12-2002
T. Berners-Lee, J. Hendler, O. Lassila; The Semantic Web; Scientific
American, 2001
D. Brickley, R.V. Guha; Resource Description Framework (RDF) Scheme
Specification 1.0; W3C Candidate Recommendation, Technical report, 2000
T. Budd; Understanding Object-Oriented Programming with Java; 2000
Addison-Wesley; ISBN 0-201-61273-9
http://www.bibl-ede.nl; De Openbare Bibliotheek Ede (in Dutch)

Download Report

A Layered Graph Model for Internet Descriptions

Paperzz.com

Your Paperzz