HOMEWORK2 Data Mining Deadline : 2017/4/4 19:00 Name : Student ID : 1. Discuss the differences between dimensionality reduction based on aggregation and dimensionality reduction based on techniques such as PCA and SVD. 2. Consider the following data set for a binary class problem. (1) Calculate the information gain when splitting on A and B. Which attribute would the decision tree induction algorithm choose? (2) Calculate the gain in the Gini index when splitting on A and B. Which attribute would the decision tree induction algorithm choose? 3. Consider the decision trees shown in the following figure. Assume they are generated from a data set that contains 16 binary attributes and 3 classes, C1, C2, and C3. Compute the total description length of each decision tree according to the minimum description length principle. 请在 Deadline 前通过 FTP 提交作业,命名格式为学号+Homework2.pdf (e.g., 123456+Homework2.pdf) The total description length of a tree is given by : Cost(tree, data) = Cost(tree) + Cost(data | tree) Each internal node of the tree is encoded by the ID of the splitting attribute. If there are m attributes, the cost of encoding each attribute is 𝑙𝑜𝑔2 𝑚 bits. Each leaf is encoded using the ID of the class it is associated with. If there are k classes, the cost of encoding a class is 𝑙𝑜𝑔2 𝑘 bits. Cost(tree) is the cost of encoding all the nodes in the tree. To simplify the computation, you can assume that the total cost of the tree is obtained by adding up the costs of encoding each internal node and each leaf node. Cost(data | tree) is encoded using the classification errors the tree commits on the training set. Each error is encoded by 𝑙𝑜𝑔2 𝑛 bits, where n is the total number of training instances. Which decision tree is better, according to the MDL principle? Please provide detailed explanations. 4. Consider the data set shown in the following table. (1) Estimate the conditional probabilities for P(A|+), P(B|+), P(C|+), P(A|-), P(B|-), and 请在 Deadline 前通过 FTP 提交作业,命名格式为学号+Homework2.pdf (e.g., 123456+Homework2.pdf) P(C|-). (2) Use the estimation of conditional probabilities given in the previous question to predict the class label for a test sample (A= 0, B= 1, C= 0)using the naïve Bayes approach. (3) Estimate the conditional probabilities using the m-estimate approach with p = 1 2 and m = 4. (4) Repeat part (2) using the conditional probabilities given in part (3). (5) Compare the two methods for estimating probabilities. Which method is better and why? 5. The following figure illustrates the Bayesian belief network for the data set shown as follows. (Assume that all the attributes are binary). Bayesian belief network (1) Draw the probability table for each node in the above network. (2) Use the Bayesian network to compute P(Engine = Bad, Air Conditioner = Broken). 请在 Deadline 前通过 FTP 提交作业,命名格式为学号+Homework2.pdf (e.g., 123456+Homework2.pdf)
© Copyright 2025 Paperzz