InterDataNet Naming System: a Scalable Architecture for Managing URIs of Heterogeneous and Distributed Data with Rich Semantics Davide Chini, Franco Pirri, Maria Chiara Pettenati, Samuele Innocenti, Lucia Ciofi Electronics and Telecommunications Department University of Florence Via Santa Marta, 3 50139 Florence, Italy [email protected], [email protected], [email protected], [email protected], [email protected] Abstract. Establishing equivalence links between (semantic) resources, as it is the case in the Linked Data approach, implies permanent search, analysis and alignment of new (semantic) data in a rapidly changing environment. Moreover the distributed management of data brings not negligible requirements as regards their authorship, update, versioning and replica management. Instead of providing solutions for the above issues at the application level, our approach relies on the adoption of a common layered infrastructure: InterDataNet (IDN). The core of the IDN architecture is the Naming System aimed at providing a scalable and open service to support consistent reuse of entities and their identifiers, enabling a global reference and addressing mechanism for convenient retrieval of resources. The IDN architecture also provides basic collaboration-oriented functions for (semantic) data, featuring authorship control, versioning and replica management through its stack layers. Keywords: interoperability, infrastructure, architecture, scalability, naming system, URIs resolution, Web of Data, collaboration 1. Introduction The main vision of the future Web takes as final goal the Semantic Web, a “global space for the seamless integration of knowledge bases into a global, open, decentralized and scalable knowledge space” (Hellman, 2009a). However, it has been understood that the realization of the Semantic Web requires a preliminary step: the so-called Web of Data (Hendler et al., 2008). Within the context of the Web of Data, creation, access, integration, and dissemination of (semantic) data is pivotal. In recent times, Linked Data, “an emerging meme deeply rooted in Web architecture, has emerged as a viable and powerful vehicle for applying the essence of the Web (URIs)” (Idehen, 2009) to the pursuit of the availability of a large amount of semantic data for building Web-wide semantic application. Linked Data, is then a way for publishing data in the direction 2 Davide Chini, Franco Pirri, Maria Chiara Pettenati, Samuele Innocenti, Lucia Ciofi of the Web of Data where a great importance has been given to the concept of resource identification. However several issues are still open in the realization of a Web of Data/Semantic Web. These issues move primarily from the well-known problem of co-reference. Coreference on the Semantic Web can occur in two ways: the first is when a single URI identifies more than one resource and the second is when multiple URIs identify the same resource. Both situations occur frequently in the Linked Data applications (Jaffri, 2008). URIs disambiguation solutions currently adopted within the Linked Data community work heavily on an "ex-post approach": to establish links between resources that are considered “equivalent”. More specifically an owl:sameAs statement is created between the different URIs denoting the entities. Indeed, owl:sameAs interlinking, leads to the creation of an unconstrained graph of URIs, because when a new link is created, it is possible to have only a partial view of the pre-existing graph of URIs. Such an approach entails two main unwanted consequences: 1) in a highly dynamic and extremely rapidly growing environment the permanent search, analysis and alignment of new data, is an extremely hard task; 2) data management and/or reasoning in a distributed environment that contains owl:sameAs relations is a non-horizontally-scalable task, because of its computational complexity (Bouquet, 2008). This is one of the open issues which delay the shift from many “local” semantic webs to one “global” Semantic Web. Starting from these assumptions InterDataNet (IDN) architecture presented in this work, moving from an original path of research within the context of the Web of Data, is able to offer some feature to help the development of the future Semantic Web. IDN infrastructure as a whole satisfies two main functions: 1) providing a scalable and open service to support a consistent reuse of entities and their identifiers, that is a global reference and addressing mechanism for locating and retrieving resources in a collaborative environment; 2) providing basic collaboration-oriented functions, namely authorship control, versioning and replica management. If TCP/IP and internetworking layered solutions allowed the Web of Document to come true, the realization of the Future Internet vision in which data tend to be active and smart entities to support applications living in the network, and being by endusers generated contents, a huge graph of interlinked data would be much easier and faster integrated if we could count on an "interdataworking" infrastructure. We define "interdataworking" as the ability to create, connect, distribute and integrate and query data across different sources on a web-wide scale. In this paper we present InterDataNet (Pettenati, Innocenti, Chini, Parlanti and Pirri, 2008), (Innocenti, 2008), an infrastructural solution supporting a decentralized and scalable publication space for the Web of Data. IDN sustains global addressability of concepts and resources as well as basic collaborative oriented services (authorship control, versioning and replica management) for distributed and heterogeneous (semantic) data management thus allowing the needed consistent reuse and mapping of entities identifiers. The IDN layered middleware aims to provide an architectural solution in the direction of an interdataworking vision. 3 Davide Chini, Franco Pirri, Maria Chiara Pettenati, Samuele Innocenti, Lucia Ciofi The IDN framework To get a linked data scalable system we have to provide first of all a shared Information Model (Prass, 2001) to enable data interoperability. We observe that an Information Model is effective when it is provided by a reference Service Architecture handling it with a global data addressability. We have designed this approach as a service-oriented middleware named IDN (InterDataNet). The adopted approach aimed at layering the information properties and characteristics into layers that address their representations at different level of abstraction. A basic service task accomplishing data and linking process was assigned to each layer. Layering is the architectural pattern to pursue scalability and legacy data integration (Avgeriou, 2005) at infrastructural level, designing an open integrated environment to distribute and to enrich knowledge around data (Melnik, 2000). Analogously to the Web-style approach we pursue a "good-enough" solution to this problem because it is at present the only way to obtain scalability on a Web-wide scenario. IDN exposes an API set to transparently facilitate data handling at higher level. We represent the information into layers from a physical view (at the IDN bottom layer) to a logical-abstract one (at the IDN top layer). We hence use this set of conceptual and technological design paradigms: the design of a layered (Zweben, 1995) middleware, following service oriented architecture (SOA) approach (OASIS, 2006); this will allow us to develop loosely coupled and interoperable services which can be combined into more complex systems; the use of REST style (Representational State Transfer) services, to make InterDataNet an explicit resource-centric infrastructure. As a consequence, IDN aims to be fully-compliant to the following architectural requirements (Richardson, 2007): communication should be stateless. Each request must contain all the required information to be completely understood; resources have to be cacheable; the system has to expose a uniform interface. Putting it in other terms each resource has to be global addressable through URIs and: the system handles resources through their representations (resources are logical entities instead representation are physical description of them. Each resource can have one or more representations and it is decoupled from that); messages handled by the system are self-descriptive because they contain meta-data (meta-data can be about the connection, such as authentication data, about the resource representations, such as their content type, and so on); resource representations can contain links to browse through the application states (for example a request which creates a resource should return a link to a representation of that resource); 4 Davide Chini, Franco Pirri, Maria Chiara Pettenati, Samuele Innocenti, Lucia Ciofi eventually, the system has to be layered. IDN framework is described through the ensemble of concepts, models and technologies pertaining to the following two views. IDN-IM (InterDataNet Information Model). It is the shared information model representing a generic document model which is independent from specific contexts and technologies. It defines the requirements, desirable properties, principles and structure of the document to be managed by IDN. IDN-SA (InterDataNet Service Architecture). It is the architectural layered model handling IDN-IM documents (it manages the IDN-IM concrete instances allowing the users to “act” on pieces of information and documents). The IDN-SA implements the reference functionalities defining subsystems, protocols and interfaces for IDN document collaborative management. The IDN-SA exposes an IDN-API (Application Programming Interface) on top of which IDN-compliant Applications can be developed. The IDN reference Information Model An Information Model can be defined as a universal representation of the entities in a managed environment, otherwise their properties, operations and relationships. It is independent from any specific repository, application, protocol or platform (Prass, 2001). The adoption of an Information Model thus implies the capability to support a number of concrete Data Models. This capability enables scalability and adaptability of the model in different contexts. Generic information modeled in IDN is formalized as an aggregation of elementary data units, named Primitive Information Unit (PIU). Each Primitive Information Unit contains generic data and metadata (see figure 1a); at a formal level, a Primitive Information Unit is a node in a directed acyclic graph (DAG) (see figure 1b). It's worth recalling that a (rooted) tree structure is a specific case of DAG in which each node has at most one parent. All data and metadata are handled, or simply stored, by the Service Architecture. An IDN-document structures information units and it is composed by nodes related to each other through directed “links”. Moreover IDN-documents can be inter-linked, so two main link types are defined in the Information Model: aggregation links, to express relations among nodes inside an IDN-document; reference links: to express relations between distinct IDN-documents. 5 Davide Chini, Franco Pirri, Maria Chiara Pettenati, Samuele Innocenti, Lucia Ciofi Fig. 1. Example of IDN-IM primitive information units and documents Each PIU belonging to the document can also be addressed as a document root node increasing information granularity and reuse. IDN-IM documents express data contents and relation between contents. These information elements are structured and specialized inside each node complying a formal xml schema description. Data and metadata are structured following the namevalue representation and embedded inside the node. IDN architecture can hand to higher level applications IDN-IM documents not only in specific IDN format but also in RDF format to offer fully compatibility with semantic web applications. The three-layers IDN Naming System In accordance to the Linked Data approach, IDN naming system adopts a URI-based naming convention to address IDN-nodes (Pettenati,et al., 2008). IDN architecture envisages a three layers naming system (see figure 2): in the upper layer are used Logical Resource Identifier (LRI) to allow IDNapplication to identify IDN-nodes. Each IDN-node can be referred thanks to a global unique canonical name and one or more "aliases"; in the second layer are used Persistent Resource Identifiers (PRI) in order to obtain a way to unambiguously, univocally and persistently identify the resources within IDN-middleware environment independently of their physical locations; in the lower layer are used Uniform Resource Locators (URL) to identify resource replicas as well as to access them. Each resource can be replicated many times and therefore many URLs will correspond to one PRI. Resolution processes are required to access a resource starting from its canonical name or from an alias. As LRIs, PRIs and URLs are sub-classes of URIs, they are hierarchical and their direct and inverse resolution is possible using DNS (Domain Name System) system (Mockapetris, 1987) and a REST-based approach. 6 Davide Chini, Franco Pirri, Maria Chiara Pettenati, Samuele Innocenti, Lucia Ciofi Fig. 2. Three layers IDN naming system The sequence of the events involved in the resolution process are detailed as follows: a generic application needs to fetch a resource and sends a GET request to its URI, for example: http://idn-nodes.example.com/nodes/miller_mail at a lower level the operating system running the application is entitled of the resolution using DNS for “idn-nodes.example.com” (the application ignores this step and, theoretically, the whole IDN system can ignore it as well) and provides to the application a TCP connection to the resolved host; the application, as soon as the connection is available sends the GET operation to the IDN stack upper layer (VR, Virtual Repository, see IDN-Service Architecture section) which is authoritative on the whole name. As the host is authoritative over the name it can access the whole metadata set related to this name. This mechanism is highly scalable; indeed it is possible to replicate the hostname at DNS level and split the computational load into different servers and/or it is possible to use reverse proxies to spread this iteration over more servers in a hierarchical way; IDN system (specifically the VR instance to which the application is connected) hides the PRI name to the application continuing the process (next steps) in an autonomous way; hence, the VR instance makes a GET operation using the PRI to the authoritative host of the PRI itself (an instance of IH/RM/LS described in IDN Service Architecture Section as well) in which the associations “PRI → URLs” are stored; IDN stack central layers (described in IDN Service Architecture Section) handle, on a need basis, the node versioning (Information History layer) and replication (Replica Management layer) to access the IDN stack lower layer, the Storage Interface (described in Section IDN Service Architecture) to bring back the requested information; IDN stack central layers instance provides the response to the VR layer; the VR instance provides the response to the application. In the case in which a node name has to be added to the naming system, the architecture proceeds as follows: when the name that has to be added is chosen, the name itself contains which server (otherwise the authoritative one) has to be contacted to add that name. Then, if the requestor has the rights to do the operations involved in the process, a new entry in the local name server is created. Either a PUT operation is used when the client has to choose the new name (or to update the data connected to an already defined one), or a POST operation is used when the client doesn’t choose the new name but requests it to the architecture. 7 Davide Chini, Franco Pirri, Maria Chiara Pettenati, Samuele Innocenti, Lucia Ciofi The IDN Service Architecture The IDN-SA provides to an effective and efficient infrastructural solution for IDNIM implementation. IDN-SA is a layered service-oriented architecture and it is composed of four layers (see figure 3 left side from bottom to top): Storage Interface Layer; Replica Management Layer; Information History Layer; Virtual Repository Layer. The IDN-compliant Application is built on top of the Virtual Repository layer exposing the IDN APIs. IDN-SA layers functions are hereafter briefly specified, starting the description from the bottom of the stack. For the sake of brevity, in this section we will not detail on two aspects related to versioning and replica management. Their integration in the IDN architecture is fundamental in order to provide collaboration-enabling functions, but their detailed description goes beyond the scope of the present paper. Storage Interface Layer (SI); this layer provides a REST-like uniform view over distributed data independently from their location and physical storage platform. This layer is eventually devoted to provide physical addressability to resources through URLs addresses (see figure 3 bottom-right side). Replica Management Layer (RM); this layer provides a delocalized view of the resources to the upper layer offering PRI (Persistent Resource Identifiers which are used here to identify resources) to URL address resolution through a service called LS (Localization Service). This layer is charged of treating the set of physical resources which are “replicas” of the same logical information providing replica updating and synchronization. Information History Layer (IH); this layer manages Primitive Information Units history providing navigation and traversing into the versioned information. At this layer, primitive information units are identified through PRIs (URN) plus an optional version parameter identifying the time-ordered position. Fig. 3. IDN-SA layers and name spaces Virtual Repository Layer (VR); it exposes the IDN APIs to the IDN-compliant Applications exploiting lower layers services. VR is seen from the application as the container-repository of all Primitive Information Units. The resolution of human friendly resources names (LRI, logical resource identifiers) into unique identifiers 8 Davide Chini, Franco Pirri, Maria Chiara Pettenati, Samuele Innocenti, Lucia Ciofi (PRI) is realized in this layer exploiting the LDNS (Logical Domain Name System) service which is logically located inside the VR layer (see figure 3, top-right side). Exploiting Information History Service (which manages versioning) VR implements the UEVM, the Unified Extensional Versioning Model (Asklund, 2002), to allow changes traceability of the IDN-DAG structure as well as non-structured information unit contents. A sub-service of the VR layer, namely the Resource Aggregation Service (RAS), is entitled to collect the content from different PIUs and to built from this content a document after an request received from the IDNapplications. Exploiting IDN for the Web of Data IDN provides an infrastructural solution to address URIs co-reference issues. Indeed IDN offers a way to reduce the uncontrolled and unmanaged proliferation of URIs used to identify non-informative resources thanks to an approach based on IDNalias names. In this paper it’s our aim to describe how IDN can do it in those situations where it is not strictly needed to retrieve a specific representation of the concept but it is important the concept itself. The main consideration to remember is that there are names pertaining to non-informational resources (i.e. concepts) and names given to informational resources (i.e. representations of concepts). As an example, let a researcher give a name (i.e. an URI) to the concept expressed by a given theorem thesis. Of course this researcher will also give a name to the representation of the theorem thesis and another name to the representation of the theorem demonstration. Note also that a single concept may have multiple representations. Let also another researcher to solve the some problem being unaware of the work of the first researcher. This situation will eventually result in different representations of both the thesis and the proof, but also in different names for the same concept1. As seen in the section IDN Naming System, IDN allows, through alias functionality, to relate an URI to a new one. Then with IDN alias-based approach it is possible to obtain a hierarchical structure of URIs (as a depth controlled tree structure) which takes advantage of a controlled and manageable process for creation and discovery of identifiers. This is made possible because, when an alias has to be created, IDN enables to see if the URI chosen as alias is itself already an alias of another one. Therefore it is possible to make the alias relation directly to this third URI. A depth of three or more can be obtained when it is needed to scale with the number of URI that should be related to each other to distribute the load among two or more servers or when it is needed to make alias links between URIs which already have many aliases. As an example (see fig. 4 ) an URI_C has some aliases URI_A and URI_B and an URI_3 has some other aliases (URI_1 and URI_2). Making URI_3 1 This situation is common in sciences, for example the Cook-Levin theorem was independently proved in the same historical period. 9 Davide Chini, Franco Pirri, Maria Chiara Pettenati, Samuele Innocenti, Lucia Ciofi alias of URI_C makes URI_1 and URI_2 alias of URI_C without any other change required on them. Fig. 4. IDN alias-based approach Beside, IDN process of inverse resolution makes it possible to obtain all aliases for an identifier in a two step process. Starting from an alias it is possible to reach the root of the tree using the direct resolution and then exploiting the reverse resolution visiting the tree and discovering all identifiers. When instead it is required to use different representations for non-informative resources identified by different URIs, it would be profitable to have a shared model to introduce a common representation to non informative resource. In these situations it is possible to use IDN-Nodes to contain the resource representation related to an URI and then to have all IDN-Nodes associated with the same concept aggregated in an IDN-IM document. As example, an IDN-node can have http://dbpedia.org/resource/Berlin as URI and as data associated to the IDN-Node the URI http://dbpedia.org/page/Berlin. In this way it is possible to make an IDN-IM document where there is an aggregation relationship among different IDN-Nodes about the same non-informative resource. Conclusions InterDataNet is an innovative architecture aiming at solving, at an infrastructural level, the problems related to the physical and local distributions of structured data and user identities over the Web, supporting collaboration-oriented features in the direction of the Web of Data. If a scalable infrastructure providing global addressability functions as well as collaboration-oriented services could be defined and implemented, the semantic applications could be more easily implemented and could be more focused on the intelligence on top of it in an integrated distributed way. This is the ambition of the InterDataNet. Acknowledgments We would like to acknowledge the valuable support of Prof. Dino Giuli for the material and scientific support to this research activity. Moreover we acknowledge the precious work of Luca Capannesi for the technical support in the implementation stage. 10 Davide Chini, Franco Pirri, Maria Chiara Pettenati, Samuele Innocenti, Lucia Ciofi References Asklund U., (2002), Configuration Management for distribuited development in an integrated envirnoment. Unpublished Doctoral Dissertation, Department of Computer Science, Lund Institute of Technology, Lund University Avgeriou P. & Zdun U., (2005), Architectural Patterns Revisited - a Pattern Language, Proceedings of the 10th European Conference on Pattern Languages of Programs (EuroPlop 2005), Irsee, Germany, July Bouquet P., Stoermer H., Cordioli, D. & Tummarello G., (2008), An Entity Name System (ENS) for the Semantic Web, 5th European Semantic Web Conference, pp. 258-272 Jaffri, A., Glaser, H., & Millard, I. (2008). URI Disambiguation in the Context of Linked Data ECS EPrints Repository. In LDOW2008, April 22, 2008. Beijing, China. Retrived July 16, 2009, from http://eprints.ecs.soton.ac.uk/15181/. Hellman, E. (2009a) Go To Hellman: Semantic Web Asteism. Retrieved June 19, 2009, from http://go-to-hellman.blogspot.com/2009/06/semantic-web-asteism.html. Hendler, J., Shadbolt, N., Hall, W., Berners-Lee, T., & Weitzner, D. (2008). Web science: an interdisciplinary approach to understanding the web. Commun. ACM, 51(7), 60-69. doi: 10.1145/1364782.1364798. Idehen, K. (2009). Using Linked Data Solves Real Problems. Keynote speech, Semantic Web Technology Conference 2009 San Jose California. Retrieved June 17, 2009, from http://www.semantic-conference.com/session/2012/. Innocenti, S. (2008) InterDataNet: nuove frontiere per l’integrazione e l’elaborazione dei dati: visione e progettazione di un modello infrastrutturale per l’interdataworking Unpublished Doctoral dissertation, University of Florence Italy Melnik S. & Decker S. (2000). A Layered Approach to Information Modeling and Interoperability on the Web. In Proceedings of the ECDL’00 Workshop on the Semantic Web. In: ECDL'00 Workshop on the Semantic Web, September 18-20, 2000, Lisbon, Portugal. Mockapetris P., (1987). RFC 1035: Domain names - implementation and specification. The Internet Engineering Task Force. OASIS, (2006) Reference Model for Service Oriented Architecture 1.0 OASIS Standard. Pettenati M.C., Innocenti, S., Chini D, Parlanti D. and Pirri, F. (2008) Interdatanet: A Data Web Foundation For The Semantic Web Vision, Iadis International Journal On Www/Internet Vol.6 Issue 2 Prass, A., (2001). RFC 3444: On the Difference between Information Models and Data Models. The Internet Engineering Task Force. Richardson L. and Ruby S. (2007), RESTful Web Services. O’REILLY, Zweben S.H., Edwards, S., Weide, B. Hollingsworth J. (1995). The Effects of Layering and Encapsulation on Software Development Cost and Quality. IEEE Transactions on Software Engineering, Vol. 21, No. 3, pp. 200-208.
© Copyright 2026 Paperzz