The Freelib Framework: Peer-to-Peer Search Enhanced by Evolving

Freelib: Peer-to-peer-based Digital Libraries*
A. Amrou
Computer Science Department
Old Dominion University
Norfolk, VA, USA
[email protected]
K. Maly
Computer Science Department
Old Dominion University
Norfolk, VA, USA
[email protected]
Abstract
In this paper, we propose a P2P-based digital
library that takes advantage of P2P networks and
digital libraries. The key problem with P2P network
searches is the low recall value, time to completion
and its high use of network bandwidth. In this paper
we introduce Freelib a universal client that once
installed on a user’s machine will connect itself to a
P2P network and after a few searches will become
aware of the community the user belongs to. The
architecture of Freelib is such that a web of
connections is created automatically for people who
share common interests, i.e., have similar searches.
We report in this paper on the first prototype client,
the design changes made in the architecture as we
actually built the client, and the development of an
emulator that can validate the working of the client in
a real P2P network.
1. Introduction
Digital libraries, the Web, and P2P networks all serve
to disseminate information. Digital libraries have the
advantage that the content is usually quality controlled
and searching can be done through structured
metadata. P2P networks have the advantage of
sustainability as they do not rely on any central
organization to provide resources such as servers and
maintenance staff. We combine the advantages of
digital libraries with P2P networks to propose Freelib,
a P2P-based digital library that addresses the
sustainability probability of traditional digital libraries
because of the centralized framework used for their
deployment.
The two key concepts in Freelib that we introduced
to address these issues are: an overlay (logical)
network whose topology has the ‘small world’
property and a virtual grouping of nodes into
communities of common interests. The small world
property [1, 2] refers to networks where for each node
* This work is supported in part by NSF grant 0333547
M. Zubair
Computer Science Department
Old Dominion University
Norfolk, VA, USA
[email protected]
it holds that any other nodes can be reached in a ‘small’
(typically 7) number of steps. We define a community
of common interests as one where nodes mostly access
objects at members of the community. Clearly
membership in a community is not static and it changes
with the evolving interest of the user. The Freelib
architecture brings users sharing similar interest closer
in the sense that searches will reach all nodes
containing ‘interesting’ objects in a short number of
hops. In other words, shared items will be close to
their points of need. This significantly enhances the
user experience as users get relevant results faster.
Community evolution in Freelib is thus based on access
pattern analysis where access means downloading an
object that was found through a search. In addition, it is
done in a distributed way to avoid introducing
centralized components into the architecture.
In order to evolve the users into communities of
common interest, we need to: 1) capture the individual
user’s interest, 2) identify those peers that share the
same interest, and 3) build an overlay network such that
peers sharing common interest are connected to each
other or at least close to each other on the network. To
identify peers that share common interest with a user,
every node maintains a ranked list of known peers (we
will also call them friends). This ranking is done
according to the node’s local user interest, i.e., those the
node accessed most or by which node a node was
accessed most. We call the resulting overlay network
the access or the friend network. Researchers have
studied small world networks to a great depth and a
number of protocols exist that will maintain networks
such that they have the small (call it k) world property.
Symphony [3] is one of them that is particularly well
suited to our domain and we will adapt it to Freelib
such that we can bound any search in Freelib by k hops.
The original concept of Freelib was described in [4]
which also gave a brief description of the architecture
and a potential implementation. In this paper, we report
on the changes in the architecture, the client
implementation, the discovery protocol, and the
preliminary experimentation using an emulator we
built for this purpose.
The remainder of this paper is organized as
follows: in section 2 we overview the background and
related work. In section 3, we present the Freelib
architecture as it has evolved from the original one [4].
We report on the experiments in section 4. In section 5
we present a brief description of our prototype
implementation. We conclude and describe our future
work in section 6.
2. Related work
Several peer-to-peer search algorithms and
techniques have been devised by researches and real
networks. Napster [5] is a peer-to-peer system that
uses a centralized index for searching and file
downloads are done directly between peers. Freenet
[6], for example, uses Depth-First Search (DFS). This
helps achieve anonymity. Other systems like Gnutella
use Breadth-First Search (BFS). These search
algorithms suffer several drawbacks and challenges
[7]. DFS search is suitable for retrieval of items given
their identifier. If used for general keyword search, it
suffers poor performance. BFS is more suitable for the
general keyword search. However, it is very
bandwidth inefficient as the number of messages
grows exponentially with the TTL and average node
degree (# messages = d TTL where, d is the average
degree of the nodes and TTL is the hop limit). There
has been considerable effort by researchers to address
those peer-to-peer search issues. Several techniques
for enhancing the search performance were presented
in [8]. These techniques include: 1) Iterative
Deepening: sending searches with successively
increasing TTL till the query is satisfied, 2) Directed
DFS (Depth-First Search): sending to only peers that
provided good results recently, and 3) Local indices:
building local indices at each node indexing the
content on the peers within r hops.
Some peer-to-peer systems utilize caching,
replication, and the concept of super-peers to enhance
the system performance. Kazaa [9] is an example peerto-peer network that utilizes nodes having high
network bandwidth to be super-nodes. Those supernodes caches contents from other nodes and do most
of the request forwarding and traffic. Other peer-topeer protocols try to enhance the selectivity of the
search. That is to target peers that have relevant
content. SSW [10], for example, clusters member
nodes based on the similarity of the content such that
nodes containing similar material are connected
together. To measure the similarity of documents,
these documents are represented in some data structure
such as a keyword vector. Some clustering algorithm
is run on documents to cluster them. This approach is,
however, complicated and does not evolve quickly
when user’s interest changes between topics. These
approaches are data-centered approaches.
3. Freelib architecture
The main objectives of the Freelib architecture are:
(a) utilize access patterns to improve searches, (b)
making sure the network is connected and any node in
the network can reach any other nodes in short number
of hops, and (c) the network can be easily maintained in
presence of frequent leaves and joins. The Freelib
architecture currently has two overlay networks: the
access network and the support network to achieve our
objectives (a) and (b). In our initial design we had a
third overlay network, migration network, to make the
process of new join efficient. However, we found the
overhead of maintaining this network outweighs the
benefits. The support network is based on the
symphony protocol [3]. It maintains the connectivity of
the network and enables new nodes to perform searches
and discover their friends (nodes with common
interests). The access network is based on user’s access
pattern and it brings nodes sharing similar interest close
to each other on the overlay network.
Support network
Access network
Nodes of
evolving(Friend)
Access
communities
links
Short
contacts
Long contacts
Figure 1: Freelib network architecture
For addressing objective (c), we need a mechanism
for nodes, when rejoining, to discover the latest
information (like IP address) of their friends. When a
node rejoins the network, the information it is keeping
about its friends might already be outdated (e.g., those
nodes might have changed their IP address if using a
dialup connection. Also, nodes might change their
locations on the symphony ring when they rejoin). To
be able to contact its friends, the node needs some way
of discovery to find out the most recent information
about its friend nodes. We explored three approaches
to address this issue: flooding the network, distributed
hash table, and link discovery.
We now discuss in some details the two overlay
networks and the discovery protocol. Figure 1 shows
the Freelib network architecture similarly to the one
reported in [4]. In the next section we shall briefly
review the original design and in the sections
thereafter report on the new aspects that we developed
as we built the first client.
3.1.
Summary of previous work
3.1.1. Support Network. As mentioned earlier, the
support network enables new nodes to search and
discover their friends. In addition, it provides means
for building efficient discovery protocols for locating
and discovering friends upon rejoining as will be
discussed below. Furthermore, the support network
maintains a small-world property [3] and also ensures
the network is always connected. Small-world
networks have the desirable property that the expected
path length between two arbitrary nodes is very small
compared to the number of participating nodes [1, 2].
To obtain and maintain this property we make use of
the Symphony [3] protocol originally developed to
implement distributed hash tables efficiently. The
Symphony protocol arranges nodes uniformly on a
unit ring. Each node maintains two types of contacts:
short contacts and long contacts. Short contacts are
neighbors on the ring. Long contacts are nodes far
away on the ring. The selection of the locations of
long contacts is the key factor that affects the average
path length within the network. Symphony selects
these locations by sampling from a group of harmonic
probability distributions, which in turn gives expected
path length of (log2 n) / k, where n is the number of
nodes and k is the number of long contacts per node.
3.1.2. Access Network. The Freelib protocol captures
the user’s interest transparently as the user searches
and accesses documents from other nodes. The access
network evolves based on the user interest such that
users sharing similar interest are connected together or
at least close to each other on the access network. The
access pattern analysis is done locally by every node
in a distributed way to avoid introducing centralized
components into the Freelib architecture. Peers are
ranked by every node according to the local user’s
interest in those peers. The topmost peers on the
ranked list are those in which the local user shows the
most interest, that is, from which the user downloads
the most documents (and vice versa). Every node
establishes friend relationships with the first f peers on
its ranked list, where f is a parameter of the system.
This represents the access network. The access network
evolves as the user’s interest changes. As nodes start to
have enough friends, search requests will be forwarded
using the friend links only. Most of the relevant results
are expected to be returned in very few hops on the
access network. These few hops include the nodes that
have closest interest with the requesting node.
3.1.3. Ranking. Capturing user’s interest can be done
in many ways. User feedback is one such way.
However, it requires explicit interaction with the user.
An implicit and transparent alternative we have selected
is for every node to monitor user accesses to identify
those peers in which the user is most interested and
those which are interested in the user’s collection. We
do use here the terms user and nodes interchangeably
and it should be remembered that not every node
contains a collection, that is, some user install the
Freelib client only to search but have no intention to
publish documents. The main principle here is that the
two nodes that are accessing each other quite often
most likely have similar interests and will most likely
access each other in future. This principle is used in
other fields such as computer architecture, e.g., RISC
instruction set, cache and memory hierarchy design. It
is important to note that by accesses, we mean
downloads of items rather than looking at hits returned
to a user’s query. Downloading items after viewing
their descriptions and metadata is more indicative of
user interest in those items than just looking at
metadata.
3.2. An improved architecture
3.2.1 Friends and Communities. In the original design
we had a migration protocol to have friends literally
move towards each other on ring network. The purpose
of the migration layer was to facilitate the discovery of
the node’s community. As we tried to implement the
protocol, it turned out to be quite complex. We needed
specific steps to discover when a node should migrate
to a different location on the ring. The ring structure
also induces a partial ordering of the communities that
is not necessarily linked to closeness of the
communities in terms of content. The ring structure
would have had a definite advantage in its ease of
describing communities (how many members, who are
the members) and implementing exhaustive searches
within a community efficiently. We believe these
advantages do not outweigh the complexity of the
protocol and the overhead of every operation to
maintain the communities.
3.2.2. Search modes. When a user joins for the first
time (a new node), a global search mode is used. In
this search mode, search requests are forwarded using
the support network contacts. As the user starts to
access other peers, the local node will start
establishing friend links to peers on the top of the
ranked list. Once the node detects that enough friend
links have been established, it switches to community
search mode. In this mode, search requests are
forwarded using only the friend links. If the user is not
satisfied with the results, she can change the client
configuration to force a global search or a limited
global search. This allows the search request to reach
nodes outside the community. This feature is
especially helpful when the user submits a search for
some topics that are not relevant to her community.
The limited global search is a combination of these
two modes. The search uses both friend links and the
contact links (short and long), however we limit the
number of hops for each sequence not to the diameter
of the network (small world diameter) but to a pre-set
parameter.
Remember, in Symphony, nodes are assigned a random
position on the unit ring and each node is said to
manage the interval to the next node. If the hash
function chosen is good then each node should just
manage the ID of one other node. The proposed
discovery DHT maps from node UUIDs to node
information. Whenever a node joins the network, it
inserts an entry for itself in the DHT. The DHT entry
for a Freelib node is maintained by the peer that owns
the ring location generated by applying a universal hash
function to the node’s UUID. A node owns (manages)
all the locations between its own location and the
location of the next node on the symphony ring.
Whenever a node wants to discover the information of a
peer, it just performs a DHT lookup. It directs its
discovery request to the location generated by applying
the universal hash function to the peer’ UUID. The
DHT discovery protocol is more efficient in terms of
bandwidth usage. Every discovery request needs (log2
n)/k hops and every hop, the request is forwarded to
only one peer. Thus, the number of messages per
discovery request is (log2 n) /k in the worst case.
3.2.3. Discovery protocols. Peer-to-peer networks are
usually characterized by frequent joins and leaves. In
addition, peers are typically short-lived. Thus, when a
node rejoins, its information about friend nodes (most
importantly the host IP address and port number)
might already have become obsolete. A discovery
protocol that enables nodes to locate their old friends
is needed. The main operation the discovery protocol
performs is to return a current node’s information
given the node’s ID (assuming the node is alive). The
discovery protocol highlights the need for unique ID’s
for nodes. We use the Universally Unique Identifiers
(UUIDs) defined in [11] for that purpose. In addition
to their use in the discovery protocols, unique
identifiers are employed locally by every node as hash
keys for ranking and keeping peers’ information. We
propose three different discovery protocols. These
protocols are: flooding, DHT discovery, and Link
discovery. We implemented both the flooding and the
DHT discovery protocols in our prototype. However,
we chose to enable DHT discovery as it consumes
much less network bandwidth than flooding and has a
lower worst case complexity. We now give some
details on the DHT discovery implementation.
4. Experiments
The DHT discovery protocol is implemented using a
Distributed Hash Table (DHT) on top of the support
layer. Specifically, we build a distributed hash table of
all UUIDs and store them at the nodes of the
symphony ring. Each node stores the (IP, port)
information of all nodes, whose UUID hash into the
interval between the node and its successor.
We have completed our initial experimentation with the
prototype. We utilized a cluster of 32 machines to
emulate a network of 200 Freelib nodes, that is, we
deployed 200 actual clients on these 32 machines. We
are working on a simulation of the Freelib protocol that
will enable experiments with large network sizes. The
results of those experiments will be published soon.
Here, we report results on our initial experimentation.
We model the user behavior by using random variables
for items such as the search rate (the average number of
search requests per unit time), the download rate (the
average number of downloads per search request).
Currently, we are using the uniform probability
distribution for those random variables.
The
publications as well as the search queries for each node
were generated according to the randomly assigned
community of the node.
The results in Figure 2 show the recall as a function
of the number of friends for different values of TTL. For
this experiment, the number of nodes is 200, the
number of Symphony long contacts is 4, the number of
communities is 4, and the number of nodes in each
community is almost (probabilistically) the same (50
each). We calculated recall as the percentage of the
relevant items returned. TTL was changed from 1 to 5,
and for each TTL value, the number of friends is
changed from 0 to 5. Every point is the average recall
for 200 search requests. The results show that the recall
increases as the number of friends per node increases.
For example, for TTL of 4, using 4 friends per node
gives 87% increase in recall over the same network
with no friends. The latter case is essentially a simple
P2P network and the first case a typical Freelib
network. We observed that a given level of recall
could be achieved by different combinations of the
parameters TTL and number of friends. This is
interesting since a good choice of the values for those
parameters can result in significant savings in
bandwidth as well as significant enhancement to the
response time. Consider for example the following
two settings: 1) TTL = 2, friends = 7 (not shown in
Figure 2 but run as an experiment); and 2) TTL = 5,
friends = 0. These two settings give roughly the same
recall level of around 85%. However, with simple
calculations, we can show that the first choice is much
more efficient in terms of bandwidth usage. The
number of messages per search request for the second
setting is 6 + 62 + 63 + 64 + 65 = 9330. On the other
hand, for the first setting it is 7 + 72 = 56. The first
setting gives more than 99% of bandwidth savings
over the second. This bandwidth savings is due to the
smaller TTL and targeting the relevant nodes by using
the friend network for forwarding the search requests.
However, the Freelib client at each node needs to store
more information (about the friend peers).
1.2
Figure 3. Freelib main user interface
Network of 200 Nodes
1
TTL
Recall
0.8
1
2
0.6
3
0.4
4
5
0.2
0
0
1
2
3
4
5
1 0.0311 0.05697 0.07298 0.08846 0.11702 0.13341
2 0.07368 0.17778 0.2566 0.38975 0.49269 0.60107
Figure 4. Freelib publishing tool
3 0.28968 0.51325 0.68787 0.63749 0.90036 0.77748
4 0.52952 0.87017 0.85521 0.87789 0.9888 0.97164
5. Implementation
5 0.82262 0.90077 0.94385 0.95781 0.97897 0.98907
# Friends
Figure 2: Recall vs. # Friends per node for different
TTL values
In addition to the savings in bandwidth, the first
option has smaller TTL. The first option with TTL=2,
takes 2 * x units to complete the search, where x units
is the average time to complete a search request one
hop away. In comparison, it takes 5 * x units time to
complete the search for the second option. Thus, the
first setting gives a 60 % savings in the response time
over the second.
The initial Freelib client design was presented in [4].
We implemented a prototype of the Freelib client in
Java. Figure 3 shows the Freelib main user interface.
Figure 4 shows the Publishing tool and Figure 5 shows
the Configuration tool. The main user interface
provides the user access to the main services, e.g.,
search, publish, and access. In addition, it provides
access to the configuration tool, which enables the user
to adjust the various configuration parameters of the
Freelib client. The main user interface uses tabs to
display the search results as well as the local collection.
The first tab on the main interface displays the items in
the local user’s collection. For every active search
request, a tab is created to display the search results.
Every row displays the main metadata elements for
one item. For search result tabs, there is one extra
column that displays information about the peer which
has the item. The user can click the Download button
to download the selected item on a search results tab.
Upon downloading an item, an entry is appended to
the outgoing access log. Upon serving an item from
the local collection to some peer, an entry is appended
to the incoming access log.
community evolution techniques. We plan to
investigate peer ranking techniques that enable faster
discovery of user community especially when the user
interest shifts between topics or covers more than one
topic. This could be achieved by using ranking
algorithms that use weights. Assigning more weight to
more recent accesses ensures that the access topology
reflects the most recent user interest. An interesting
question is on how to get ‘better’ friends where better is
defined to mean that the community diameter is
minimized. In addition, we plan to work on enhancing
the java implementation to be in line with feedback
from real researchers who are using Freelib.
Furthermore, we will validate (and or improve) through
simulation choices we have made for the discovery
protocol, overlay structure, and parameters of the
network.
7. References
Figure 5. Freelib configuration tool
6. Conclusion and future work
In this paper, we introduced Freelib, a novel peerto-peer based digital library that has an overlay
network which evolves with user interest and is based
on simple access pattern analyses. The proposed P2Pbased digital library is efficient in terms of searches
and does not flood the network.
The Freelib
architecture is based on a pure distributed protocol as
opposed to broker based P2P systems. The preliminary
evaluation of the protocol and architecture, based on
emulating a network of 200 real nodes, shows
promising results. Freelib gives significant savings in
network bandwidth. In addition, it gives significant
enhancement in response times as well as recall. We
are currently working on simulating the Freelib
protocol to verify the bandwidth savings and the
performance gain for larger network sizes. As part of
future work, we plan to work on enhancing the
[1] J. Kleinberg, “The Small-World Phenomenon: An
Algorithmic Perspective”, Proceedings of the 32nd ACM
symposium on theory of Computing, Portland, OR, USA, May
21-23, 2000.
[2] D. J. Watts, and S. H. Strogatz, “Collective Dynamics of
'Small-World' Networks”, Nature 393, 1998, pp. 440-442.
[3] G. S. Manku, M. Bawa, and P. Raghavan, “Symphony:
Distributed Hashing in a Small World”, Proceedings of the
4th USENIX Symposium on Internet Technologies and
Systems, 2003.
[4] A. Amrou, K. Maly, M. Zubair, “Freelib: A Selfsustainable Digital Library for Education Community”,
Proceedings of World Conference on Educational
Multimedia, Hypermedia and Telecommunications (EDMEDIA’04), 2004(1), pp. 15-20.
[5] The Napster home page: http://www.napster.com.
[6] I. Clarke, O. Sandberg, B. Wiley, and T. W. Hong,
“Freenet: A Distributed Anonymous Information Storage and
Retrieval System”, In Designing Privacy Enhancing
Technologies: Proceedings of International Workshop on
Design Issues in Anonymity and Unobservability, Berkeley,
CA, USA, July 2000, pp. 46-66.
[7] N. Daswani, H. Garcia-Molina, B. Yang, “Open Problems
in Data-sharing Peer-to-peer Systems”, Proceedings of the 9th
International Conference on Database Theory (ICDT 2003),
Siena, Italy, 8-10 January 2003.
[8]B. Yang, and H. Gracia-Molina, “Improving Search in
Peer-to-Peer Networks”, Proceedings of the 22nd conference
on Distributed Computing Systems (ICDCS’02), Vienna,
Austria, July 2-5, 2002.
[9]
The
home
page
for
Kazaa:
http://www.kazaa.com/us/help/new_p2p.htm.
[10] M. Li, W. Lee, A. Sivasubramaniam, and D. Lee, “A
Small World Overlay Network for Semantic Based Search in
P2P Systems”, The Second WWW Workshop on Semantics in
Peer-to-Peer and Grid Computing (SemPGRID'04), New
York City, NY, May 2004, pp. 71-90.
[11] P. J. Leach, and R. Salz, “UUIDs and GUIDs”, Internet
draft, draft-leach-uuids-guids-01.txt, Aug 1998.