Index design for dynamic .pdf

Index Design for Dynamic Personalized PageRank
Amit Pathak#, Soumen Chakrabarti#l, Manish Gupta#
#IIT Bombay
Isoumen@cse. iitb .ac. in
Abstract- Personalized PageRank, related to random walks
with restarts and conductance in resistive networks, is a frequent search paradigm for graph-structured databases. While
efficient batch algorithms exist for static whole-graph PageRank,
interactive query-time personalized PageRank has proved more
challenging. Here we describe how to select and build indices
for a popular class of PageRank algorithms, so as to provide
real-time personalized PageRank and smoothly trade off between
index size, preprocessing time, and query speed. We achieve this
by developing a precise, yet efficiently estimated performance
model for personalized PageRank query execution. We use this
model in conjunction with a query workload in a cost-benefit type
index optimizer. On millions of queries from CITESEER and its
data graphs with 74-320 thousand nodes, our algorithm runs
50-400x faster than whole-graph PageRank, the gap growing
with graph size. Index size is 10-20% of a text index. Ranking
accuracy is above 94%.
PPVs at selected hub nodes. Following Jeh and Widom [1], he
proposed a generic hub selection strategy LPR: select "LargePageRank" nodes. Compared to our algorithm, LPR picks
markedly inferior hub sets (§IV).
I. INTRODUCTION
Contribution and organization: Our system [4] is shown
in Figure 1. §11 summarizes the query processing algorithm.
§111 models its running time as a function of the hub set. §IV
presents how to use the model for hub set selection. §V deals
with hub PPV precomputation.
Personalized PageRank [1] (PPR), closely related to random
walks with restarts and conductance in resistive electrical
networks, is a dominant search and ranking paradigm for
graph-structured data. Let the data graph G = (VE) have
edges (u, v) C E associated with a conductance C(v, u),
which is the probability of a "random surfer" walking from
u to v, with Ev C(v, u) = 1. In each step the surfer
walks with probability a (typically 0.75-0.8), and jumps with
probability 1- a, landing at node v with probability r(v).
Here EV r(v) = 1 and r is called the teleport vector. The
personalized PageRank (PPV) for teleport r is
(1)
Pr = aCPr + (1 -a)r = (1 -a)( -[ aC) 1r.
(A special case is where r(v) = 1 for a single node v and
0 for all others. We call this teleport 6v, and p6, is called
the personalized PageRank vector for v, also denoted PPVv.)
In applications, r is specified at query time. Computing Pr
over the whole graph in real-time (using Power Iterations)
with interactive response has been impractical thus far. In this
paper, we present a new workload-driven index optimization
algorithm that makes personalized PageRank real-time, independent of G.
Previous work and limitations:
OBJECTRANK [2] supports keyword queries on graphs by conceptually making
each word a node w in G, connecting w to all entities
where it is mentioned. OBJECTRANK precomputes and stores
PPVw for all words w in the vocabulary. Preprocessing costs
increase rapidly with increasing graph and vocabulary size
(we estimated 22,000 CPU hours for 562,000 words). Berkhin
[3] proposed the Bookmark Coloring Algorithm (BCA) to
compute PPRs, and also showed how to use precomputed
978-1-4244-1837-4/08/$25.00 (© 2008 IEEE
Query word
probability
estimates
Train
Hub Select on (4)
Hub Set
Query Time
Model (3)
Hub PPV
preparation (5)
H
T
Sample+
Query
Query
Logs
Processor (2)
Ranked Responses
Test Sampl
Fig. 1. HUBRANK system overview with section numbers.
II. ASYNCHRONOUS WEIGHT-PUSHING ALGORITHM
Eqn. (1) can be expressed as the infinite series Pr = (1
a) (Ek>oakCk)r. Pr can be evaluated using asynchronous
propagation of weights on edges of G (also called "weight
pushing" or BCA, "bookmark coloring algorithm"), shown in
Figure 2.
III. MODELING PUSH ALGORITHM PERFORMANCE
We now describe the predictive model for the running
time of BCA with hubs. This is challenging because the
exact number of push steps is unpredictable, and, owing to
Epush and H, BCA touches only a small active subgraph
of G, whose size we call PushActive(H, 60, Epush) (using
teleport 60). Experimentally, we found that PushActive is
a good predictor for BCA running time, which scales up
1: q
r,
NH,,
0, BH,,
0
2: while llq lll > Epush do
3: pick node u with largest q(u) > 0
4: q <- q(u), q(u) '- 0
5:
if u C H then
7:
else
6:
{delete-ax
BH,,(u) *BH,,(u) + }{residual q is blocked}
NH,,(U) NH,, (U) + (1 -)q
for each out-neighbor v of u do
10:
q(v) '- q(v) + aC(v, u)q { "push"
11: return NH,r +Z,hGHBH,r(h) PPVh
8:
9:
step}
Fig. 2. Bookmark coloring algorithm with hubs.
1489
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 4, 2009 at 03:12 from IEEE Xplore. Restrictions apply.
ICDE 2008
superlinearly with PushActive. Unfortunately, PushActive
cannot be known without running BCA. We approximate the
active subgraph with nodes that can be reached from o without
touching H, and using paths whose conductance (product
of edge conductances) is more than 6push. The size of this
subgraph is denoted PathActive(H, 0, EJpUsh) and this is also
a good predictor of BCA running time.
Our
cost-benefit
needs
to
know
optimizer
PathActive(H, 0, E(push) for all origins o, but it would
be too expensive to run V graph expansions, one from
each node. We adapt an algorithm by Cohen [5]. We
preprocess G in O( E log V + V log2 V ) time and
build certain lists of length 0 (log V ) per node. Thereafter,
PathActive(H, 0, (Epush) can be estimated in O (log log V ),
so we take only O ( V log log V ) time over all origins. We
call the estimates CohenActive(H, 60o Epush). Figure 3 shows
that BCA time can be predicted (within a factor of two) via
a regression from CohenActive(H, 6, Epush) which we call
Regress(.).
|+ Word
+
x Entity+
+
3 30000
+
C
iSil
B
,+4
n:'00
20000
+
+t
m
++
+
+
+
++
10000 SW4
peDelMax(r,u)
We will drop H and Epush when they are fixed and clear from
context. Assuming Regress(.) is roughly linear (least-squares
fit in Figure 3), the work saved is proportional to:
Regress [ZpCDeMax(r, ) CohenActive(q(p, u)u)]
But we will not know DelMax(r, u) without actually running
the push algorithm. So, we use the additivity (3) property
because q(p, u) is almost always tiny (10-11 ... 10-5):
Regress [CohenActive ( EpCDelMax(r,u) q(p, u))]
ulN1H,Ja))], i.e.,
B. Work saved by u over query workload
The work saved by indexing PPVu, averaged over the query
workload distribution f(w) is
Zw f(w) WorkSaved(H, 6w, u)
Z w f (w) PushTime (H, NH,_
(4)
As NH,6;, (u) is very small for sufficiently large graphs, we
use (2) to write
(4) Ew f(w) NH1 5_ (u) PushTime(H, 6u)
(us)h)
0
0
peDelMax(r,u)
(Note q(p, u) is a scalar; 5t, a vector.) By §111, this is
S Regress (CohenActive(H, q(p, u)6u, Epush))
which equals Regress [CohenActive (
PushTime (H, NH,() , Epush)
50000
40000
q a number of times. Let the sequence of these removal
instances be DelMax(r, u). For every delete operation p C
DelMax(r, u), u has a residual score q = q(p, u).
Using additivity (3), we approximate the benefit of including
node u in H for query r by
WorkSaved(H, r, u) =
PushTime(H, q(p, u)>u, Epush)
20000
40000
CohenActive
60000
Fig. 3. From CohenActive(H, 60, Epush) we can get a reasonable estimate
of actual BCA time (shown in ms).
CohenActive(H, a6,) (therefore, PathActive and
PushTime) is not linear wrt a; it shows a steep, roughly
linear growth followed by saturation to a fixed value as
activation a rises above 0. In the linear growth regime,
CohenActive(H, a r) a CohenActive(H, r) (2)
CohenActive(H, Ei ri) Ei CohenActive(H, ri) (3)
1h
(f
)
PushTime(H,6,,)
ZEw f (w)NH, (U),
w
the advantage being that Ew f (w)NH,6 (U)
NH,f (U),
f suffices.
so a single PageRank computation with r
PushTime(H, 6u) can therefore be quickly estimated for all
u together using results from §111.
-
C. Clipped PPVu storage space
Storing PPVh for all h C H takes excessive space. AggresIV. HUB SET SELECTION
sive clipping (removing elements below a threshold -y) hardly
Using the model from §111, we develop a cost-benefit style damages ranking accuracy while reducing index space [2].
[6] greedy technique to select H. The marginal benefit of But now our cost-benefit optimizer needs an estimate of the
including a node u into H is the work saved because u blocks clipped PPV size, which we can know exactly only after
the push (Figure 2). The cost is the space needed to store computing PPVh. We found that PPV elements are powerPPVu. We describe the cost and benefit estimation for single- law distributed, and exploit this property, together with another
word queries; this can be extended to short multi-word queries. application of Cohen's algorithm [5], to estimate the size of
For each word w in the vocabulary W, we estimate a word clipped PPVs accurately.
probability f(w) from the query log.
D. Hub inclusion policies
A. Work saved for one query if u C H
We tried three inclusion strategies. Chakrabarti [7] ordered
If u C H in Figure 2, the residual q is "grounded" (i.e. hubs u one-shot by only NH f (u); this we call "naive onenot pushed to neighbors v), possibly saving a lot of pushes shot" or Ni. The greedy cost/benefit method proposed here is
downstream. Consider a fixed query specified by teleport r. called "lookahead progressive" or LAP and shown in Figure 4.
While executing the query, u is removed from the heap Batched updates to H account for reduction of merit of
-
1490
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 4, 2009 at 03:12 from IEEE Xplore. Restrictions apply.
1:
2:
3:
4:
5:
6:
7:
8:
Inputs: target HI, clip threshold -y, batchSize
find CohenActive(0,
60u, Epush 0) for all u C V
thus estimate clipped sizes of all PPVu, {cost}
H <- 0
while H is not large enough do
for all nodes u not in H yet compute {benefit}
EZ
10000
1000
Nl
LAP
LPR
_
E
f (w) WorkSaved(H, 6, u)
100
set merit(u) = benefit(u) /cost(u)
greedily include best batchSize nodes in H
C)
Ir
< 10
Fig. 4. Lookahead progressive hub selection.
nodes owing to nodes included earlier. Third is the baseline
"large PageRank" or LPR policy, which orders entity nodes
in decreasing global PageRank order with uniform teleport
r= I/V.
E. Experiments
Data preparation:
We obtained (thanks to Prof. C. Lee
Giles) the CITESEER graph with over 1.1 million entity nodes,
3.7 million edges, and 709,000 words. For scaling studies,
we took temporal snapshots up to years 1994, 1996, 1998,
and 2000. Lucene text indices occupy 55, 139, 259, 378
MB respectively. We also obtained 1.9 million CITESEER
queries The average query has 2.68 words. Word frequencies
are distributed in a Zipfian manner. We typically trained
HUBRANK on 100,000-query samples from the log and tested
on disjoint 10,000-query samples. The Power Iteration baseline
was run until L1 difference between iterates went below 10-6.
Hub inclusion policy comparison:
HUBRANK beats
Chakrabarti's system [7] wrt both index size (10x) and query
speed (10 x). Moreover, LAP with PPV has typical RAG,
precision and Kendall's accuracy (at rank 20) of 0.998, 0.95
and 0.94, higher than [7].
Effect of scaling G and H:
By scaling H at a small
fraction of V HUBRANK query times can be held essentially
fixed independent of G As the PPV index is on disk and
clipped PPVs are quite small, scaling lH with lV is practical.
Our index size (9-13 MB for the 1994 graph) is very small
compared to a text index (55 MB), and can be built much faster
than lHB Power Iterations [4].
HublndexSize,(MB)
onn
ouu
u'600-
E
E
1996,177K,756K,31 s
1998,320K, 1.4M, 1 13s
a)\
>0
CD
200-
O
0.1
Fig. 6.
IVI,
0.15
IHI/IVI
Scaling studies. Each line is for
and Power Iteration times shown.
a
0.2
0.25
different graph; snapshot
year,
140
v Iter
*MPWH
MPWH
*MPWH
MPWH
120
,,1 00
o
a)
>
,
80
IHI=45k
IHI=40k
IHI=35k
IHI=30k
60
D
40
0
20
E
.
0
0
#PPVs 10000
20000
30000
40000
Fig. 7. While early PPVs use Power Iteration, later PPVs leverage them
using BCA and run faster and faster.
This shows that scaling
We finally turn to the question of populating the hub index
efficiently. The baseline approach is to compute PPVh for
each h C H independently using Power Iterations, which, in
practice, takes time Xc H ( E + V ).
If H is ordered judiciously, PPVs computed earlier can
speed up the computation of other PPVs. This is because
computing a PPVh is also a "query" with r = 6h.
The ordering problem is NP-hard via set cover. Intuitively,
we should first schedule nodes h which block many "heavy"
paths from other (pending) hubs; i.e., we want the nodes that
have maximum personalized PageRank wrt teleport to nodes
in H ("MPWH").
Figure 7 shows that, once H reaches critical mass, more
PPVs can be added to H very fast. Each line is for a different
final lHB (but this hardly matters). The end effect is that the
total PPV indexing time quickly levels out as lHB is increased.
13
11
Fig. 5. Comparison of LAP, NI, LPR orders for the 1994 snapshot. Note:
A text index needs 55MB; y-axis is log-scale.
T
V. BUILDING THE PPV INDEX
9
up
H mildly with G is
a
practical
proposition.
REFERENCES
[1] G. Jeh and J. Widom, "Scaling personalized web search," in WWW
Conference, 2003, pp. 271-279. http://www2003.org/cdrom/papers/
refereed/pl85/html/pl85-jeh.html
[2] A. Balmin, V. Hristidis, and Y Papakonstantinou, "Authority-based
keyword queries in databases using ObjectRank," in VLDB Conference,
Toronto, 2004.
[3] P. Berkhin, "Bookmark-coloring approach to personalized pagerank computing," Internet Mathematics, vol. 3, no. 1, Jan. 2007, preprint.
[4] "NETRANK and HUBRANK project," http://www.cse.iitb.ac.in/lsoumen/
doc/netrank/, 2006.
[5] E. Cohen, "Estimating the size of the transitive closure in linear time,"
in FOCS Conference, 1994, pp. 190-200.
[6] G. Graefe, "Query evaluation techniques for large databases," ACM
Computing Survey, vol. 25, no. 2, pp. 73-170, 1993.
[7] S. Chakrabarti, "Dynamic personalized PageRank in entity-relation
graphs," in WWW Conference, Banff, May 2007.
1491
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 4, 2009 at 03:12 from IEEE Xplore. Restrictions apply.

Download Report

Index design for dynamic .pdf

Paperzz.com

Your Paperzz