Freelib: Peer-to-peer-based Digital Libraries* A. Amrou Computer Science Department Old Dominion University Norfolk, VA, USA [email protected] K. Maly Computer Science Department Old Dominion University Norfolk, VA, USA [email protected] Abstract In this paper, we propose a P2P-based digital library that takes advantage of P2P networks and digital libraries. The key problem with P2P network searches is the low recall value, time to completion and its high use of network bandwidth. In this paper we introduce Freelib a universal client that once installed on a user’s machine will connect itself to a P2P network and after a few searches will become aware of the community the user belongs to. The architecture of Freelib is such that a web of connections is created automatically for people who share common interests, i.e., have similar searches. We report in this paper on the first prototype client, the design changes made in the architecture as we actually built the client, and the development of an emulator that can validate the working of the client in a real P2P network. 1. Introduction Digital libraries, the Web, and P2P networks all serve to disseminate information. Digital libraries have the advantage that the content is usually quality controlled and searching can be done through structured metadata. P2P networks have the advantage of sustainability as they do not rely on any central organization to provide resources such as servers and maintenance staff. We combine the advantages of digital libraries with P2P networks to propose Freelib, a P2P-based digital library that addresses the sustainability probability of traditional digital libraries because of the centralized framework used for their deployment. The two key concepts in Freelib that we introduced to address these issues are: an overlay (logical) network whose topology has the ‘small world’ property and a virtual grouping of nodes into communities of common interests. The small world property [1, 2] refers to networks where for each node * This work is supported in part by NSF grant 0333547 M. Zubair Computer Science Department Old Dominion University Norfolk, VA, USA [email protected] it holds that any other nodes can be reached in a ‘small’ (typically 7) number of steps. We define a community of common interests as one where nodes mostly access objects at members of the community. Clearly membership in a community is not static and it changes with the evolving interest of the user. The Freelib architecture brings users sharing similar interest closer in the sense that searches will reach all nodes containing ‘interesting’ objects in a short number of hops. In other words, shared items will be close to their points of need. This significantly enhances the user experience as users get relevant results faster. Community evolution in Freelib is thus based on access pattern analysis where access means downloading an object that was found through a search. In addition, it is done in a distributed way to avoid introducing centralized components into the architecture. In order to evolve the users into communities of common interest, we need to: 1) capture the individual user’s interest, 2) identify those peers that share the same interest, and 3) build an overlay network such that peers sharing common interest are connected to each other or at least close to each other on the network. To identify peers that share common interest with a user, every node maintains a ranked list of known peers (we will also call them friends). This ranking is done according to the node’s local user interest, i.e., those the node accessed most or by which node a node was accessed most. We call the resulting overlay network the access or the friend network. Researchers have studied small world networks to a great depth and a number of protocols exist that will maintain networks such that they have the small (call it k) world property. Symphony [3] is one of them that is particularly well suited to our domain and we will adapt it to Freelib such that we can bound any search in Freelib by k hops. The original concept of Freelib was described in [4] which also gave a brief description of the architecture and a potential implementation. In this paper, we report on the changes in the architecture, the client implementation, the discovery protocol, and the preliminary experimentation using an emulator we built for this purpose. The remainder of this paper is organized as follows: in section 2 we overview the background and related work. In section 3, we present the Freelib architecture as it has evolved from the original one [4]. We report on the experiments in section 4. In section 5 we present a brief description of our prototype implementation. We conclude and describe our future work in section 6. 2. Related work Several peer-to-peer search algorithms and techniques have been devised by researches and real networks. Napster [5] is a peer-to-peer system that uses a centralized index for searching and file downloads are done directly between peers. Freenet [6], for example, uses Depth-First Search (DFS). This helps achieve anonymity. Other systems like Gnutella use Breadth-First Search (BFS). These search algorithms suffer several drawbacks and challenges [7]. DFS search is suitable for retrieval of items given their identifier. If used for general keyword search, it suffers poor performance. BFS is more suitable for the general keyword search. However, it is very bandwidth inefficient as the number of messages grows exponentially with the TTL and average node degree (# messages = d TTL where, d is the average degree of the nodes and TTL is the hop limit). There has been considerable effort by researchers to address those peer-to-peer search issues. Several techniques for enhancing the search performance were presented in [8]. These techniques include: 1) Iterative Deepening: sending searches with successively increasing TTL till the query is satisfied, 2) Directed DFS (Depth-First Search): sending to only peers that provided good results recently, and 3) Local indices: building local indices at each node indexing the content on the peers within r hops. Some peer-to-peer systems utilize caching, replication, and the concept of super-peers to enhance the system performance. Kazaa [9] is an example peerto-peer network that utilizes nodes having high network bandwidth to be super-nodes. Those supernodes caches contents from other nodes and do most of the request forwarding and traffic. Other peer-topeer protocols try to enhance the selectivity of the search. That is to target peers that have relevant content. SSW [10], for example, clusters member nodes based on the similarity of the content such that nodes containing similar material are connected together. To measure the similarity of documents, these documents are represented in some data structure such as a keyword vector. Some clustering algorithm is run on documents to cluster them. This approach is, however, complicated and does not evolve quickly when user’s interest changes between topics. These approaches are data-centered approaches. 3. Freelib architecture The main objectives of the Freelib architecture are: (a) utilize access patterns to improve searches, (b) making sure the network is connected and any node in the network can reach any other nodes in short number of hops, and (c) the network can be easily maintained in presence of frequent leaves and joins. The Freelib architecture currently has two overlay networks: the access network and the support network to achieve our objectives (a) and (b). In our initial design we had a third overlay network, migration network, to make the process of new join efficient. However, we found the overhead of maintaining this network outweighs the benefits. The support network is based on the symphony protocol [3]. It maintains the connectivity of the network and enables new nodes to perform searches and discover their friends (nodes with common interests). The access network is based on user’s access pattern and it brings nodes sharing similar interest close to each other on the overlay network. Support network Access network Nodes of evolving(Friend) Access communities links Short contacts Long contacts Figure 1: Freelib network architecture For addressing objective (c), we need a mechanism for nodes, when rejoining, to discover the latest information (like IP address) of their friends. When a node rejoins the network, the information it is keeping about its friends might already be outdated (e.g., those nodes might have changed their IP address if using a dialup connection. Also, nodes might change their locations on the symphony ring when they rejoin). To be able to contact its friends, the node needs some way of discovery to find out the most recent information about its friend nodes. We explored three approaches to address this issue: flooding the network, distributed hash table, and link discovery. We now discuss in some details the two overlay networks and the discovery protocol. Figure 1 shows the Freelib network architecture similarly to the one reported in [4]. In the next section we shall briefly review the original design and in the sections thereafter report on the new aspects that we developed as we built the first client. 3.1. Summary of previous work 3.1.1. Support Network. As mentioned earlier, the support network enables new nodes to search and discover their friends. In addition, it provides means for building efficient discovery protocols for locating and discovering friends upon rejoining as will be discussed below. Furthermore, the support network maintains a small-world property [3] and also ensures the network is always connected. Small-world networks have the desirable property that the expected path length between two arbitrary nodes is very small compared to the number of participating nodes [1, 2]. To obtain and maintain this property we make use of the Symphony [3] protocol originally developed to implement distributed hash tables efficiently. The Symphony protocol arranges nodes uniformly on a unit ring. Each node maintains two types of contacts: short contacts and long contacts. Short contacts are neighbors on the ring. Long contacts are nodes far away on the ring. The selection of the locations of long contacts is the key factor that affects the average path length within the network. Symphony selects these locations by sampling from a group of harmonic probability distributions, which in turn gives expected path length of (log2 n) / k, where n is the number of nodes and k is the number of long contacts per node. 3.1.2. Access Network. The Freelib protocol captures the user’s interest transparently as the user searches and accesses documents from other nodes. The access network evolves based on the user interest such that users sharing similar interest are connected together or at least close to each other on the access network. The access pattern analysis is done locally by every node in a distributed way to avoid introducing centralized components into the Freelib architecture. Peers are ranked by every node according to the local user’s interest in those peers. The topmost peers on the ranked list are those in which the local user shows the most interest, that is, from which the user downloads the most documents (and vice versa). Every node establishes friend relationships with the first f peers on its ranked list, where f is a parameter of the system. This represents the access network. The access network evolves as the user’s interest changes. As nodes start to have enough friends, search requests will be forwarded using the friend links only. Most of the relevant results are expected to be returned in very few hops on the access network. These few hops include the nodes that have closest interest with the requesting node. 3.1.3. Ranking. Capturing user’s interest can be done in many ways. User feedback is one such way. However, it requires explicit interaction with the user. An implicit and transparent alternative we have selected is for every node to monitor user accesses to identify those peers in which the user is most interested and those which are interested in the user’s collection. We do use here the terms user and nodes interchangeably and it should be remembered that not every node contains a collection, that is, some user install the Freelib client only to search but have no intention to publish documents. The main principle here is that the two nodes that are accessing each other quite often most likely have similar interests and will most likely access each other in future. This principle is used in other fields such as computer architecture, e.g., RISC instruction set, cache and memory hierarchy design. It is important to note that by accesses, we mean downloads of items rather than looking at hits returned to a user’s query. Downloading items after viewing their descriptions and metadata is more indicative of user interest in those items than just looking at metadata. 3.2. An improved architecture 3.2.1 Friends and Communities. In the original design we had a migration protocol to have friends literally move towards each other on ring network. The purpose of the migration layer was to facilitate the discovery of the node’s community. As we tried to implement the protocol, it turned out to be quite complex. We needed specific steps to discover when a node should migrate to a different location on the ring. The ring structure also induces a partial ordering of the communities that is not necessarily linked to closeness of the communities in terms of content. The ring structure would have had a definite advantage in its ease of describing communities (how many members, who are the members) and implementing exhaustive searches within a community efficiently. We believe these advantages do not outweigh the complexity of the protocol and the overhead of every operation to maintain the communities. 3.2.2. Search modes. When a user joins for the first time (a new node), a global search mode is used. In this search mode, search requests are forwarded using the support network contacts. As the user starts to access other peers, the local node will start establishing friend links to peers on the top of the ranked list. Once the node detects that enough friend links have been established, it switches to community search mode. In this mode, search requests are forwarded using only the friend links. If the user is not satisfied with the results, she can change the client configuration to force a global search or a limited global search. This allows the search request to reach nodes outside the community. This feature is especially helpful when the user submits a search for some topics that are not relevant to her community. The limited global search is a combination of these two modes. The search uses both friend links and the contact links (short and long), however we limit the number of hops for each sequence not to the diameter of the network (small world diameter) but to a pre-set parameter. Remember, in Symphony, nodes are assigned a random position on the unit ring and each node is said to manage the interval to the next node. If the hash function chosen is good then each node should just manage the ID of one other node. The proposed discovery DHT maps from node UUIDs to node information. Whenever a node joins the network, it inserts an entry for itself in the DHT. The DHT entry for a Freelib node is maintained by the peer that owns the ring location generated by applying a universal hash function to the node’s UUID. A node owns (manages) all the locations between its own location and the location of the next node on the symphony ring. Whenever a node wants to discover the information of a peer, it just performs a DHT lookup. It directs its discovery request to the location generated by applying the universal hash function to the peer’ UUID. The DHT discovery protocol is more efficient in terms of bandwidth usage. Every discovery request needs (log2 n)/k hops and every hop, the request is forwarded to only one peer. Thus, the number of messages per discovery request is (log2 n) /k in the worst case. 3.2.3. Discovery protocols. Peer-to-peer networks are usually characterized by frequent joins and leaves. In addition, peers are typically short-lived. Thus, when a node rejoins, its information about friend nodes (most importantly the host IP address and port number) might already have become obsolete. A discovery protocol that enables nodes to locate their old friends is needed. The main operation the discovery protocol performs is to return a current node’s information given the node’s ID (assuming the node is alive). The discovery protocol highlights the need for unique ID’s for nodes. We use the Universally Unique Identifiers (UUIDs) defined in [11] for that purpose. In addition to their use in the discovery protocols, unique identifiers are employed locally by every node as hash keys for ranking and keeping peers’ information. We propose three different discovery protocols. These protocols are: flooding, DHT discovery, and Link discovery. We implemented both the flooding and the DHT discovery protocols in our prototype. However, we chose to enable DHT discovery as it consumes much less network bandwidth than flooding and has a lower worst case complexity. We now give some details on the DHT discovery implementation. 4. Experiments The DHT discovery protocol is implemented using a Distributed Hash Table (DHT) on top of the support layer. Specifically, we build a distributed hash table of all UUIDs and store them at the nodes of the symphony ring. Each node stores the (IP, port) information of all nodes, whose UUID hash into the interval between the node and its successor. We have completed our initial experimentation with the prototype. We utilized a cluster of 32 machines to emulate a network of 200 Freelib nodes, that is, we deployed 200 actual clients on these 32 machines. We are working on a simulation of the Freelib protocol that will enable experiments with large network sizes. The results of those experiments will be published soon. Here, we report results on our initial experimentation. We model the user behavior by using random variables for items such as the search rate (the average number of search requests per unit time), the download rate (the average number of downloads per search request). Currently, we are using the uniform probability distribution for those random variables. The publications as well as the search queries for each node were generated according to the randomly assigned community of the node. The results in Figure 2 show the recall as a function of the number of friends for different values of TTL. For this experiment, the number of nodes is 200, the number of Symphony long contacts is 4, the number of communities is 4, and the number of nodes in each community is almost (probabilistically) the same (50 each). We calculated recall as the percentage of the relevant items returned. TTL was changed from 1 to 5, and for each TTL value, the number of friends is changed from 0 to 5. Every point is the average recall for 200 search requests. The results show that the recall increases as the number of friends per node increases. For example, for TTL of 4, using 4 friends per node gives 87% increase in recall over the same network with no friends. The latter case is essentially a simple P2P network and the first case a typical Freelib network. We observed that a given level of recall could be achieved by different combinations of the parameters TTL and number of friends. This is interesting since a good choice of the values for those parameters can result in significant savings in bandwidth as well as significant enhancement to the response time. Consider for example the following two settings: 1) TTL = 2, friends = 7 (not shown in Figure 2 but run as an experiment); and 2) TTL = 5, friends = 0. These two settings give roughly the same recall level of around 85%. However, with simple calculations, we can show that the first choice is much more efficient in terms of bandwidth usage. The number of messages per search request for the second setting is 6 + 62 + 63 + 64 + 65 = 9330. On the other hand, for the first setting it is 7 + 72 = 56. The first setting gives more than 99% of bandwidth savings over the second. This bandwidth savings is due to the smaller TTL and targeting the relevant nodes by using the friend network for forwarding the search requests. However, the Freelib client at each node needs to store more information (about the friend peers). 1.2 Figure 3. Freelib main user interface Network of 200 Nodes 1 TTL Recall 0.8 1 2 0.6 3 0.4 4 5 0.2 0 0 1 2 3 4 5 1 0.0311 0.05697 0.07298 0.08846 0.11702 0.13341 2 0.07368 0.17778 0.2566 0.38975 0.49269 0.60107 Figure 4. Freelib publishing tool 3 0.28968 0.51325 0.68787 0.63749 0.90036 0.77748 4 0.52952 0.87017 0.85521 0.87789 0.9888 0.97164 5. Implementation 5 0.82262 0.90077 0.94385 0.95781 0.97897 0.98907 # Friends Figure 2: Recall vs. # Friends per node for different TTL values In addition to the savings in bandwidth, the first option has smaller TTL. The first option with TTL=2, takes 2 * x units to complete the search, where x units is the average time to complete a search request one hop away. In comparison, it takes 5 * x units time to complete the search for the second option. Thus, the first setting gives a 60 % savings in the response time over the second. The initial Freelib client design was presented in [4]. We implemented a prototype of the Freelib client in Java. Figure 3 shows the Freelib main user interface. Figure 4 shows the Publishing tool and Figure 5 shows the Configuration tool. The main user interface provides the user access to the main services, e.g., search, publish, and access. In addition, it provides access to the configuration tool, which enables the user to adjust the various configuration parameters of the Freelib client. The main user interface uses tabs to display the search results as well as the local collection. The first tab on the main interface displays the items in the local user’s collection. For every active search request, a tab is created to display the search results. Every row displays the main metadata elements for one item. For search result tabs, there is one extra column that displays information about the peer which has the item. The user can click the Download button to download the selected item on a search results tab. Upon downloading an item, an entry is appended to the outgoing access log. Upon serving an item from the local collection to some peer, an entry is appended to the incoming access log. community evolution techniques. We plan to investigate peer ranking techniques that enable faster discovery of user community especially when the user interest shifts between topics or covers more than one topic. This could be achieved by using ranking algorithms that use weights. Assigning more weight to more recent accesses ensures that the access topology reflects the most recent user interest. An interesting question is on how to get ‘better’ friends where better is defined to mean that the community diameter is minimized. In addition, we plan to work on enhancing the java implementation to be in line with feedback from real researchers who are using Freelib. Furthermore, we will validate (and or improve) through simulation choices we have made for the discovery protocol, overlay structure, and parameters of the network. 7. References Figure 5. Freelib configuration tool 6. Conclusion and future work In this paper, we introduced Freelib, a novel peerto-peer based digital library that has an overlay network which evolves with user interest and is based on simple access pattern analyses. The proposed P2Pbased digital library is efficient in terms of searches and does not flood the network. The Freelib architecture is based on a pure distributed protocol as opposed to broker based P2P systems. The preliminary evaluation of the protocol and architecture, based on emulating a network of 200 real nodes, shows promising results. Freelib gives significant savings in network bandwidth. In addition, it gives significant enhancement in response times as well as recall. We are currently working on simulating the Freelib protocol to verify the bandwidth savings and the performance gain for larger network sizes. As part of future work, we plan to work on enhancing the [1] J. Kleinberg, “The Small-World Phenomenon: An Algorithmic Perspective”, Proceedings of the 32nd ACM symposium on theory of Computing, Portland, OR, USA, May 21-23, 2000. [2] D. J. Watts, and S. H. Strogatz, “Collective Dynamics of 'Small-World' Networks”, Nature 393, 1998, pp. 440-442. [3] G. S. Manku, M. Bawa, and P. Raghavan, “Symphony: Distributed Hashing in a Small World”, Proceedings of the 4th USENIX Symposium on Internet Technologies and Systems, 2003. [4] A. Amrou, K. Maly, M. Zubair, “Freelib: A Selfsustainable Digital Library for Education Community”, Proceedings of World Conference on Educational Multimedia, Hypermedia and Telecommunications (EDMEDIA’04), 2004(1), pp. 15-20. [5] The Napster home page: http://www.napster.com. [6] I. Clarke, O. Sandberg, B. Wiley, and T. W. Hong, “Freenet: A Distributed Anonymous Information Storage and Retrieval System”, In Designing Privacy Enhancing Technologies: Proceedings of International Workshop on Design Issues in Anonymity and Unobservability, Berkeley, CA, USA, July 2000, pp. 46-66. [7] N. Daswani, H. Garcia-Molina, B. Yang, “Open Problems in Data-sharing Peer-to-peer Systems”, Proceedings of the 9th International Conference on Database Theory (ICDT 2003), Siena, Italy, 8-10 January 2003. [8]B. Yang, and H. Gracia-Molina, “Improving Search in Peer-to-Peer Networks”, Proceedings of the 22nd conference on Distributed Computing Systems (ICDCS’02), Vienna, Austria, July 2-5, 2002. [9] The home page for Kazaa: http://www.kazaa.com/us/help/new_p2p.htm. [10] M. Li, W. Lee, A. Sivasubramaniam, and D. Lee, “A Small World Overlay Network for Semantic Based Search in P2P Systems”, The Second WWW Workshop on Semantics in Peer-to-Peer and Grid Computing (SemPGRID'04), New York City, NY, May 2004, pp. 71-90. [11] P. J. Leach, and R. Salz, “UUIDs and GUIDs”, Internet draft, draft-leach-uuids-guids-01.txt, Aug 1998.
© Copyright 2026 Paperzz