HW#1 (Due Tuesday, July 8th in class): Q1. Suppose there is a repository of ten million documents, and word w appears in 320 of them. In a particular document d, the maximum number of occurrences of a word is 15. Approximately what is the TF.IDF score for w if that word appears (a) once (b) five times? Q2. Using Hive and the Titanic dataset, compare the survival probabilities for children in the 1st, 2nd and 3rd classes. Show your Hive queries and provide screen snapshots of the query results. Q3. (f) Using R and the arules package, extract the association rules and sort them based on the lift (highest to lowest). State your assumptions regarding the support and confidence. Show your R script and the resulting rules. Q4.
© Copyright 2026 Paperzz