sv-lncs - InterDataNet

InterDataNet Naming System: a Scalable Architecture
for Managing URIs of Heterogeneous and Distributed
Data with Rich Semantics
Davide Chini, Franco Pirri, Maria Chiara Pettenati, Samuele Innocenti, Lucia Ciofi
Electronics and Telecommunications Department University of Florence Via Santa Marta, 3
50139 Florence, Italy
[email protected], [email protected], [email protected],
[email protected], [email protected]
Abstract. Establishing equivalence links between (semantic) resources, as it is
the case in the Linked Data approach, implies permanent search, analysis and
alignment of new (semantic) data in a rapidly changing environment. Moreover
the distributed management of data brings not negligible requirements as
regards their authorship, update, versioning and replica management. Instead of
providing solutions for the above issues at the application level, our approach
relies on the adoption of a common layered infrastructure: InterDataNet (IDN).
The core of the IDN architecture is the Naming System aimed at providing a
scalable and open service to support consistent reuse of entities and their
identifiers, enabling a global reference and addressing mechanism for
convenient retrieval of resources. The IDN architecture also provides basic
collaboration-oriented functions for (semantic) data, featuring authorship
control, versioning and replica management through its stack layers.
Keywords: interoperability, infrastructure, architecture, scalability, naming
system, URIs resolution, Web of Data, collaboration
1. Introduction
The main vision of the future Web takes as final goal the Semantic Web, a “global
space for the seamless integration of knowledge bases into a global, open,
decentralized and scalable knowledge space” (Hellman, 2009a). However, it has been
understood that the realization of the Semantic Web requires a preliminary step: the
so-called Web of Data (Hendler et al., 2008).
Within the context of the Web of Data, creation, access, integration, and
dissemination of (semantic) data is pivotal. In recent times, Linked Data, “an
emerging meme deeply rooted in Web architecture, has emerged as a viable and
powerful vehicle for applying the essence of the Web (URIs)” (Idehen, 2009) to the
pursuit of the availability of a large amount of semantic data for building Web-wide
semantic application. Linked Data, is then a way for publishing data in the direction
2
Davide Chini, Franco Pirri, Maria Chiara Pettenati, Samuele Innocenti, Lucia Ciofi
of the Web of Data where a great importance has been given to the concept of
resource identification.
However several issues are still open in the realization of a Web of Data/Semantic
Web. These issues move primarily from the well-known problem of co-reference. Coreference on the Semantic Web can occur in two ways: the first is when a single URI
identifies more than one resource and the second is when multiple URIs identify the
same resource. Both situations occur frequently in the Linked Data applications
(Jaffri, 2008). URIs disambiguation solutions currently adopted within the Linked
Data community work heavily on an "ex-post approach": to establish links between
resources that are considered “equivalent”. More specifically an owl:sameAs
statement is created between the different URIs denoting the entities. Indeed,
owl:sameAs interlinking, leads to the creation of an unconstrained graph of URIs,
because when a new link is created, it is possible to have only a partial view of the
pre-existing graph of URIs.
Such an approach entails two main unwanted consequences:
1) in a highly dynamic and extremely rapidly growing environment the permanent
search, analysis and alignment of new data, is an extremely hard task;
2) data management and/or reasoning in a distributed environment that contains
owl:sameAs relations is a non-horizontally-scalable task, because of its computational
complexity (Bouquet, 2008). This is one of the open issues which delay the shift from
many “local” semantic webs to one “global” Semantic Web.
Starting from these assumptions InterDataNet (IDN) architecture presented in this
work, moving from an original path of research within the context of the Web of
Data, is able to offer some feature to help the development of the future Semantic
Web. IDN infrastructure as a whole satisfies two main functions:
1) providing a scalable and open service to support a consistent reuse of entities
and their identifiers, that is a global reference and addressing mechanism for
locating and retrieving resources in a collaborative environment;
2) providing basic collaboration-oriented functions, namely authorship control,
versioning and replica management.
If TCP/IP and internetworking layered solutions allowed the Web of Document to
come true, the realization of the Future Internet vision in which data tend to be active
and smart entities to support applications living in the network, and being by endusers generated contents, a huge graph of interlinked data would be much easier and
faster integrated if we could count on an "interdataworking" infrastructure. We define
"interdataworking" as the ability to create, connect, distribute and integrate and query
data across different sources on a web-wide scale.
In this paper we present InterDataNet (Pettenati, Innocenti, Chini, Parlanti and
Pirri, 2008), (Innocenti, 2008), an infrastructural solution supporting a decentralized
and scalable publication space for the Web of Data. IDN sustains global
addressability of concepts and resources as well as basic collaborative oriented
services (authorship control, versioning and replica management) for distributed and
heterogeneous (semantic) data management thus allowing the needed consistent reuse
and mapping of entities identifiers. The IDN layered middleware aims to provide an
architectural solution in the direction of an interdataworking vision.
3
Davide Chini, Franco Pirri, Maria Chiara Pettenati, Samuele Innocenti, Lucia Ciofi
The IDN framework
To get a linked data scalable system we have to provide first of all a shared
Information Model (Prass, 2001) to enable data interoperability. We observe that an
Information Model is effective when it is provided by a reference Service
Architecture handling it with a global data addressability. We have designed this
approach as a service-oriented middleware named IDN (InterDataNet). The adopted
approach aimed at layering the information properties and characteristics into layers
that address their representations at different level of abstraction. A basic service task
accomplishing data and linking process was assigned to each layer.
Layering is the architectural pattern to pursue scalability and legacy data
integration (Avgeriou, 2005) at infrastructural level, designing an open integrated
environment to distribute and to enrich knowledge around data (Melnik, 2000).
Analogously to the Web-style approach we pursue a "good-enough" solution to this
problem because it is at present the only way to obtain scalability on a Web-wide
scenario. IDN exposes an API set to transparently facilitate data handling at higher
level. We represent the information into layers from a physical view (at the IDN
bottom layer) to a logical-abstract one (at the IDN top layer).
We hence use this set of conceptual and technological design paradigms:

the design of a layered (Zweben, 1995) middleware, following service oriented
architecture (SOA) approach (OASIS, 2006); this will allow us to develop
loosely coupled and interoperable services which can be combined into more
complex systems;

the use of REST style (Representational State Transfer) services, to make
InterDataNet an explicit resource-centric infrastructure. As a consequence,
IDN aims to be fully-compliant to the following architectural requirements
(Richardson, 2007):

communication should be stateless. Each request must contain
all the required information to be completely understood;

resources have to be cacheable;

the system has to expose a uniform interface. Putting it in other
terms each resource has to be global addressable through URIs
and:
 the system handles resources through their representations
(resources are logical entities instead representation are
physical description of them. Each resource can have one
or more representations and it is decoupled from that);
 messages handled by the system are self-descriptive
because they contain meta-data (meta-data can be about
the connection, such as authentication data, about the
resource representations, such as their content type, and so
on);
 resource representations can contain links to browse
through the application states (for example a request which
creates a resource should return a link to a representation
of that resource);
4
Davide Chini, Franco Pirri, Maria Chiara Pettenati, Samuele Innocenti, Lucia Ciofi

eventually, the system has to be layered.
IDN framework is described through the ensemble of concepts, models and
technologies pertaining to the following two views.
IDN-IM (InterDataNet Information Model). It is the shared information model
representing a generic document model which is independent from specific contexts
and technologies. It defines the requirements, desirable properties, principles and
structure of the document to be managed by IDN.
IDN-SA (InterDataNet Service Architecture). It is the architectural layered
model handling IDN-IM documents (it manages the IDN-IM concrete instances
allowing the users to “act” on pieces of information and documents). The IDN-SA
implements the reference functionalities defining subsystems, protocols and interfaces
for IDN document collaborative management. The IDN-SA exposes an IDN-API
(Application Programming Interface) on top of which IDN-compliant
Applications can be developed.
The IDN reference Information Model
An Information Model can be defined as a universal representation of the entities
in a managed environment, otherwise their properties, operations and relationships. It
is independent from any specific repository, application, protocol or platform (Prass,
2001). The adoption of an Information Model thus implies the capability to support a
number of concrete Data Models. This capability enables scalability and adaptability
of the model in different contexts. Generic information modeled in IDN is formalized
as an aggregation of elementary data units, named Primitive Information Unit (PIU).
Each Primitive Information Unit contains generic data and metadata (see figure 1a); at
a formal level, a Primitive Information Unit is a node in a directed acyclic graph
(DAG) (see figure 1b). It's worth recalling that a (rooted) tree structure is a specific
case of DAG in which each node has at most one parent.
All data and metadata are handled, or simply stored, by the Service Architecture.
An IDN-document structures information units and it is composed by nodes related to
each other through directed “links”. Moreover IDN-documents can be inter-linked, so
two main link types are defined in the Information Model:
 aggregation links, to express relations among nodes inside an IDN-document;
 reference links: to express relations between distinct IDN-documents.
5
Davide Chini, Franco Pirri, Maria Chiara Pettenati, Samuele Innocenti, Lucia Ciofi
Fig. 1. Example of IDN-IM primitive information units and documents
Each PIU belonging to the document can also be addressed as a document root
node increasing information granularity and reuse.
IDN-IM documents express data contents and relation between contents. These
information elements are structured and specialized inside each node complying a
formal xml schema description. Data and metadata are structured following the namevalue representation and embedded inside the node. IDN architecture can hand to
higher level applications IDN-IM documents not only in specific IDN format but also
in RDF format to offer fully compatibility with semantic web applications.
The three-layers IDN Naming System
In accordance to the Linked Data approach, IDN naming system adopts a URI-based
naming convention to address IDN-nodes (Pettenati,et al., 2008). IDN architecture
envisages a three layers naming system (see figure 2):
 in the upper layer are used Logical Resource Identifier (LRI) to allow IDNapplication to identify IDN-nodes. Each IDN-node can be referred thanks to a
global unique canonical name and one or more "aliases";
 in the second layer are used Persistent Resource Identifiers (PRI) in order to obtain
a way to unambiguously, univocally and persistently identify the resources within
IDN-middleware environment independently of their physical locations;
 in the lower layer are used Uniform Resource Locators (URL) to identify resource
replicas as well as to access them. Each resource can be replicated many times and
therefore many URLs will correspond to one PRI.
Resolution processes are required to access a resource starting from its canonical
name or from an alias. As LRIs, PRIs and URLs are sub-classes of URIs, they are
hierarchical and their direct and inverse resolution is possible using DNS (Domain
Name System) system (Mockapetris, 1987) and a REST-based approach.
6
Davide Chini, Franco Pirri, Maria Chiara Pettenati, Samuele Innocenti, Lucia Ciofi
Fig. 2. Three layers IDN naming system
The sequence of the events involved in the resolution process are detailed as
follows:
 a generic application needs to fetch a resource and sends a GET request to its URI,
for example: http://idn-nodes.example.com/nodes/miller_mail
 at a lower level the operating system running the application is entitled of the
resolution using DNS for “idn-nodes.example.com” (the application ignores this
step and, theoretically, the whole IDN system can ignore it as well) and provides to
the application a TCP connection to the resolved host;
 the application, as soon as the connection is available sends the GET operation to
the IDN stack upper layer (VR, Virtual Repository, see IDN-Service Architecture
section) which is authoritative on the whole name. As the host is authoritative over
the name it can access the whole metadata set related to this name. This
mechanism is highly scalable; indeed it is possible to replicate the hostname at
DNS level and split the computational load into different servers and/or it is
possible to use reverse proxies to spread this iteration over more servers in a
hierarchical way;
 IDN system (specifically the VR instance to which the application is connected)
hides the PRI name to the application continuing the process (next steps) in an
autonomous way;
 hence, the VR instance makes a GET operation using the PRI to the authoritative
host of the PRI itself (an instance of IH/RM/LS described in IDN Service
Architecture Section as well) in which the associations “PRI → URLs” are stored;
 IDN stack central layers (described in IDN Service Architecture Section) handle,
on a need basis, the node versioning (Information History layer) and replication
(Replica Management layer) to access the IDN stack lower layer, the Storage
Interface (described in Section IDN Service Architecture) to bring back the
requested information;
 IDN stack central layers instance provides the response to the VR layer;
 the VR instance provides the response to the application.
In the case in which a node name has to be added to the naming system, the
architecture proceeds as follows: when the name that has to be added is chosen, the
name itself contains which server (otherwise the authoritative one) has to be contacted
to add that name. Then, if the requestor has the rights to do the operations involved in
the process, a new entry in the local name server is created. Either a PUT operation is
used when the client has to choose the new name (or to update the data connected to
an already defined one), or a POST operation is used when the client doesn’t choose
the new name but requests it to the architecture.
7
Davide Chini, Franco Pirri, Maria Chiara Pettenati, Samuele Innocenti, Lucia Ciofi
The IDN Service Architecture
The IDN-SA provides to an effective and efficient infrastructural solution for IDNIM implementation. IDN-SA is a layered service-oriented architecture and it is
composed of four layers (see figure 3 left side from bottom to top): Storage Interface
Layer; Replica Management Layer; Information History Layer; Virtual Repository
Layer. The IDN-compliant Application is built on top of the Virtual Repository layer
exposing the IDN APIs. IDN-SA layers functions are hereafter briefly specified,
starting the description from the bottom of the stack. For the sake of brevity, in this
section we will not detail on two aspects related to versioning and replica
management. Their integration in the IDN architecture is fundamental in order to
provide collaboration-enabling functions, but their detailed description goes beyond
the scope of the present paper.
Storage Interface Layer (SI); this layer provides a REST-like uniform view over
distributed data independently from their location and physical storage platform. This
layer is eventually devoted to provide physical addressability to resources through
URLs addresses (see figure 3 bottom-right side).
Replica Management Layer (RM); this layer provides a delocalized view of the
resources to the upper layer offering PRI (Persistent Resource Identifiers which are
used here to identify resources) to URL address resolution through a service called LS
(Localization Service). This layer is charged of treating the set of physical resources
which are “replicas” of the same logical information providing replica updating and
synchronization.
Information History Layer (IH); this layer manages Primitive Information Units
history providing navigation and traversing into the versioned information. At this
layer, primitive information units are identified through PRIs (URN) plus an optional
version parameter identifying the time-ordered position.
Fig. 3. IDN-SA layers and name spaces
Virtual Repository Layer (VR); it exposes the IDN APIs to the IDN-compliant
Applications exploiting lower layers services. VR is seen from the application as the
container-repository of all Primitive Information Units. The resolution of human
friendly resources names (LRI, logical resource identifiers) into unique identifiers
8
Davide Chini, Franco Pirri, Maria Chiara Pettenati, Samuele Innocenti, Lucia Ciofi
(PRI) is realized in this layer exploiting the LDNS (Logical Domain Name System)
service which is logically located inside the VR layer (see figure 3, top-right side).
Exploiting Information History Service (which manages versioning) VR
implements the UEVM, the Unified Extensional Versioning Model (Asklund, 2002),
to allow changes traceability of the IDN-DAG structure as well as non-structured
information unit contents. A sub-service of the VR layer, namely the Resource
Aggregation Service (RAS), is entitled to collect the content from different PIUs and
to built from this content a document after an request received from the IDNapplications.
Exploiting IDN for the Web of Data
IDN provides an infrastructural solution to address URIs co-reference issues.
Indeed IDN offers a way to reduce the uncontrolled and unmanaged proliferation of
URIs used to identify non-informative resources thanks to an approach based on IDNalias names. In this paper it’s our aim to describe how IDN can do it in those
situations where it is not strictly needed to retrieve a specific representation of the
concept but it is important the concept itself. The main consideration to remember is
that there are names pertaining to non-informational resources (i.e. concepts) and
names given to informational resources (i.e. representations of concepts). As an
example, let a researcher give a name (i.e. an URI) to the concept expressed by a
given theorem thesis. Of course this researcher will also give a name to the
representation of the theorem thesis and another name to the representation of the
theorem demonstration. Note also that a single concept may have multiple
representations. Let also another researcher to solve the some problem being unaware
of the work of the first researcher. This situation will eventually result in different
representations of both the thesis and the proof, but also in different names for the
same concept1.
As seen in the section IDN Naming System, IDN allows, through alias
functionality, to relate an URI to a new one. Then with IDN alias-based approach it is
possible to obtain a hierarchical structure of URIs (as a depth controlled tree
structure) which takes advantage of a controlled and manageable process for creation
and discovery of identifiers. This is made possible because, when an alias has to be
created, IDN enables to see if the URI chosen as alias is itself already an alias of
another one. Therefore it is possible to make the alias relation directly to this third
URI. A depth of three or more can be obtained when it is needed to scale with the
number of URI that should be related to each other to distribute the load among two
or more servers or when it is needed to make alias links between URIs which already
have many aliases. As an example (see fig. 4 ) an URI_C has some aliases URI_A
and URI_B and an URI_3 has some other aliases (URI_1 and URI_2). Making URI_3
1 This situation is common in sciences, for example the Cook-Levin theorem was
independently proved in the same historical period.
9
Davide Chini, Franco Pirri, Maria Chiara Pettenati, Samuele Innocenti, Lucia Ciofi
alias of URI_C makes URI_1 and URI_2 alias of URI_C without any other change
required on them.
Fig. 4. IDN alias-based approach
Beside, IDN process of inverse resolution makes it possible to obtain all aliases for
an identifier in a two step process. Starting from an alias it is possible to reach the
root of the tree using the direct resolution and then exploiting the reverse resolution
visiting the tree and discovering all identifiers.
When instead it is required to use different representations for non-informative
resources identified by different URIs, it would be profitable to have a shared model
to introduce a common representation to non informative resource. In these situations
it is possible to use IDN-Nodes to contain the resource representation related to an
URI and then to have all IDN-Nodes associated with the same concept aggregated
in an IDN-IM document. As example, an IDN-node can have
http://dbpedia.org/resource/Berlin as URI and as data associated to the IDN-Node the
URI http://dbpedia.org/page/Berlin. In this way it is possible to make an IDN-IM
document where there is an aggregation relationship among different IDN-Nodes
about the same non-informative resource.
Conclusions
InterDataNet is an innovative architecture aiming at solving, at an infrastructural
level, the problems related to the physical and local distributions of structured data
and user identities over the Web, supporting collaboration-oriented features in the
direction of the Web of Data.
If a scalable infrastructure providing global addressability functions as well as
collaboration-oriented services could be defined and implemented, the semantic
applications could be more easily implemented and could be more focused on the
intelligence on top of it in an integrated distributed way. This is the ambition of the
InterDataNet.
Acknowledgments We would like to acknowledge the valuable support of Prof. Dino Giuli for the
material and scientific support to this research activity. Moreover we acknowledge the precious work
of Luca Capannesi for the technical support in the implementation stage.
10
Davide Chini, Franco Pirri, Maria Chiara Pettenati, Samuele Innocenti, Lucia Ciofi
References
Asklund U., (2002), Configuration Management for distribuited development in an integrated
envirnoment. Unpublished Doctoral Dissertation, Department of Computer Science, Lund
Institute of Technology, Lund University
Avgeriou P. & Zdun U., (2005), Architectural Patterns Revisited - a Pattern Language,
Proceedings of the 10th European Conference on Pattern Languages of Programs (EuroPlop
2005), Irsee, Germany, July
Bouquet P., Stoermer H., Cordioli, D. & Tummarello G., (2008), An Entity Name System
(ENS) for the Semantic Web, 5th European Semantic Web Conference, pp. 258-272
Jaffri, A., Glaser, H., & Millard, I. (2008). URI Disambiguation in the Context of Linked Data ECS EPrints Repository. In LDOW2008, April 22, 2008. Beijing, China. Retrived July 16,
2009, from http://eprints.ecs.soton.ac.uk/15181/.
Hellman, E. (2009a) Go To Hellman: Semantic Web Asteism. Retrieved June 19, 2009, from
http://go-to-hellman.blogspot.com/2009/06/semantic-web-asteism.html.
Hendler, J., Shadbolt, N., Hall, W., Berners-Lee, T., & Weitzner, D. (2008). Web science: an
interdisciplinary approach to understanding the web. Commun. ACM, 51(7), 60-69. doi:
10.1145/1364782.1364798.
Idehen, K. (2009). Using Linked Data Solves Real Problems. Keynote speech, Semantic Web
Technology Conference 2009 San Jose California. Retrieved June 17, 2009, from
http://www.semantic-conference.com/session/2012/.
Innocenti, S. (2008) InterDataNet: nuove frontiere per l’integrazione e l’elaborazione dei dati:
visione e progettazione di un modello infrastrutturale per l’interdataworking Unpublished
Doctoral dissertation, University of Florence Italy
Melnik S. & Decker S. (2000). A Layered Approach to Information Modeling and
Interoperability on the Web. In Proceedings of the ECDL’00 Workshop on the Semantic
Web. In: ECDL'00 Workshop on the Semantic Web, September 18-20, 2000, Lisbon,
Portugal.
Mockapetris P., (1987). RFC 1035: Domain names - implementation and specification. The
Internet Engineering Task Force.
OASIS, (2006) Reference Model for Service Oriented Architecture 1.0 OASIS Standard.
Pettenati M.C., Innocenti, S., Chini D, Parlanti D. and Pirri, F. (2008) Interdatanet: A Data Web
Foundation For The Semantic Web Vision, Iadis International Journal On Www/Internet
Vol.6 Issue 2
Prass, A., (2001). RFC 3444: On the Difference between Information Models and Data Models.
The Internet Engineering Task Force.
Richardson L. and Ruby S. (2007), RESTful Web Services. O’REILLY,
Zweben S.H., Edwards, S., Weide, B. Hollingsworth J. (1995). The Effects of Layering and
Encapsulation on Software Development Cost and Quality. IEEE Transactions on Software
Engineering, Vol. 21, No. 3, pp. 200-208.