HW#1 (Due Tuesday, July 8th in class): Q1. Q2. Using Hive and the

HW#1 (Due Tuesday, July 8th in class):
Q1.
Suppose there is a repository of ten million documents, and word w appears in 320 of them. In a particular
document d, the maximum number of occurrences of a word is 15. Approximately what is the TF.IDF score for w if
that word appears (a) once (b) five times?
Q2.
Using Hive and the Titanic dataset, compare the survival probabilities for children in the 1st, 2nd and 3rd
classes. Show your Hive queries and provide screen snapshots of the query results.
Q3.
(f) Using R and the arules package, extract the association rules and sort them based on the
lift (highest to lowest). State your assumptions regarding the support and confidence. Show
your R script and the resulting rules.
Q4.