IST 516: Web and Information Retrieval Fall 2013 / Dongwon Lee

IST 516: Web and Information Retrieval
Fall 2013 / Dongwon Lee
Homework #2: Information Retrieval
(SOLUTION)
Max: 110 Points (+ 30 Bonus Points)
__________________________________________________________________________
1. Exercise 6.19
Using SMART notation, this question uses lnc.ltn as the weighting scheme.
The final score for the query and the document is ~ 3.11907
2. Exercise 8.8 (two sub-questions of a and c only)
a.
System 1: MAP is (1 + 2/3 + 3/9 + 4/10) / 4 = .6
System 2: Map is (1/2 + 2/5 + 3/6 + 4/7) / 4 = .49
The Pennsylvania State University / College of Information Sciences and Technology
c.
System 1: R-precision is 2/4 = .5
System 2: R-precision is 1/4 = .25
With respect to both MAP and R-precision, System 1 does a better job than System 2
does.
3. Exercise 9.1 (Students can solve EITHER question below)
To make Qm=Q0, we should have the following:
That is, the difference of weight between the centroid of relevant documents and that of
irrelevant documents is in the same direction of Q0 with the length of (1 – alpha)|Q0|. Or,
this can occur if the centroid of the relevant documents is equal to the centroid of the
irrelevant documents and beta and gamma are equal. Finally, if the query Q0 was already at
the centroid of the relevant documents, then Qm would be the same as Q0.
OR
4. Exercise 19.8
The index sizes are the same size before B eliminates dups. B's final index size should be
smaller, since it eliminates duplicates. Using the statistics collected from the indexes (45%
of A's URL are present in B, and 50% of B's URLs are present in A), we can determine the
relative sizes of the two indexes. We have .45 |A| ≈ .5 |B| or |A|/|B| ≈ .5/.45 = 1.11. So
index A is 1.11 times bigger than index B. So, 90% of the URLs of the Web don't have
duplicates.
5. Exercise 20.3
The overall goal is to ensure the “politeness” and never have empty crawl threads. The
larger the time to wait to crawl a site (say ten times the last fetch time), the more likely a
crawl will need to visit another site, which will take entries from the back queues. The two
should balance each other to keep enough sites ready for the crawl threads and to not crawl
the same site too often. As te gets larger the number of back queues needs to increase so that
all the crawling threads will keep busy. Since there are three times more back queues exist
than crawler threads, at any given time, only 1/3 back queues are being crawled.
6. Exercise 21.10
The page rank calculation has two factors that contribute to the calculation, The first is α which
is the transportation probability, and the second is the number of pages that contain links to
your page. If we look solely at the transportation aspect of page rank, there would be an equal
probability to a random jump to any page on the web. The probability of the random jump is
α/N for each site. The page rank's random walk add to the probability of a teleport to
determine the transition probability matrix. The minimum value that any element of the
transition probability matrix can have is α/N. Therefore when you calculate the steady state
page rank vector, the values must be larger than α/N. As α approaches 1, teleportation is the
only aspect that effects the page rank, and all pages will have the same rank or 1/N.
7. Exercise 21.19
The hub score will be the number of outgoing edges of a node and the authority score will
be the number of incoming edges of a node.
Bonus Questions (30 Points)
8. Exercise 1.11 (in writing the answer algorithm, it is OK to explain the main idea in
words, instead in the style of Fig. 1.6)
To capture “NOT y”, we need to modify Algorithm in Fig 1.6 such that: (1) when docIDs of
two posting lists are the same, pointer is simply incremented, and (2) when docID of the x’s
posting list is LESS THAN that of the y’s posting list, docID of the X’s posting list is saved
as the answer. The complexity of this modified algorithm is still O(x+y).
It is OK to explain the main idea in words, instead of the Algorithm as shown below:
Answer  {}
while p1 != NIL and p2 != NIL
do if docID(p1) = docID(p2)
then
p1  next(p1)
p2  next(p2)
else if docID(p1) < docID
then ADD(answer, docID(p1))
p1  next(p1)
else p2  next(p2)
return answer
9. Exercise 2.13
(a) is True. Since we just care to determine the list of documents that satisfy a /k clause, in
the Algorithm of Fig 2.12, “|pos(pp1)-pos(pp2)|<=k” will be the main part to check. Each
time docID(p1)=docID(p2) is true, at most freq(p1)+freq(p2) operations are needed, ie,
O(L), independent of k.
10. Exercise 8.10 (two sub-questions a and b only)
a.
Kappa = [P(A)-P(E)]/[1-P(E)]
P(A) = 4/12 = 0.333
P(E) = 0.25 + 0.25 = 0.5
Kappa = (0.333 – 0.50)/(1-0.5) = -0.167/0.5 = -0.334
-0.334 implies that two judges agreements regarding relevance is worse than random.
b.
P = 1/5: out of 5 returned documents {4, 5, 6, 7, 8}, only document 4 is judged relevant by
both judges.
R = 1/2: There are 4 documents where both judges “agree”, either as relevant {3,4} or
irrelevant {1,2}. The question is a bit ambiguous but I think it should be interpreted as
“when both judges agree it is RELEVANT”. That is, only {3,4} as true answers. Therefore,
out of these two documents {3,4}, only document 4 is returned by the system correctly,
yielding 50% of recall.
F1 = 2PR/(P+R) = 2/7
Grading Rubric. For each question, corresponding max points are specified. Partial credit
will be given to answers that are only “partially and slightly incorrect” but no more
than 70% of the max points. For the substantially wrong answer, regardless of efforts,
NO partial credit will be given. Therefore, please double check all your answers carefully
including simple calculation.