Assignment 4 Data Mining Due: Friday, November 28, 9:00 pm Data Mining Data Mining is a powerful tool to study patterns and relations in numerous data that surrounds us. In class, we have learned about clustering (unsupervised learning), classification (using principal component analysis, networks, fuzzy logic, and other learning tools), and models that help to understand and predict values for new data based on training data set (using decision trees and association rules). There is a number of commercial as well as research products that implement some or all of the above tools. One such example is Open Source software issued under the GNU General Public License. Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. More advanced programmers and researchers can use Weka to develop new machine learning and data mining schemes. What you should do: 1: Follow link onto official Weka web site: http://www.cs.waikato.ac.nz/ml/weka/ and read “Getting started” information. Download and install Weka version weka-3-6-11 for Windows , Mac or Linux. Note that Windows version Windowsx64 has been tested for both self-extracting executable weka-3-611jre.exe and weka-3-6-11.exe. Install and run the software. Remember that if you are using lab machines, the software must be uninstalled following the completion of the assignment. 2: Run Weka (Weka 3.6 with console). You should be able to see a basic interface: 3: There are sample databases that will be downloaded and come with Weka program files. The specific data file format that Weka can work with is .arff. Use /data/contact-lenses.arff database for this assignment. You can browse other datasets if desired. For lenses data set, demonstrate that you can perform the following FIVE major Weka functionalities in Weka EXPLORER PACKAGE (you may find out more about them from on-line help or comprehensive Weka tutorials under Documentation links). A. [6p total] Run Classify utility. For test option chose “Use training set”. A1. [1p] For PRISM classifier, report resulting Prism rules. A2. [3p] For Decision table, report number of correctly classified instances, incorrectly classified instances and mean absolute error. A3. [2p] For Ridor classifier, report number of rules and list them all B. [6p total] Run Cluster functionality. Run DBScan, Hierarchical Clustering and SimpleKMeans methods (on training set).Store clusters for visualization. For Hierarchical Clustering and SimpleKMean chose number of clusters to be 5 (by clicking onto the string with parameters next to CHOOSE button (below Clusterer) [2p]. Use default settings for DBscan [1p]. Report clustering results for all 3 methods (clusterer’s output) [3p]. C. [4p total] Run Associate functionality with Apriori associator on lenses dataset. Provide written answers from the resulting run. C1 [1p] What is minimum support reported? C2 [1p] Minimum confidence? C3 [1p] Generated sets of large itemlists? C4 [1p]Best rules found? D. [2p total] Select Attributes Functionality – for Search method, choose Principal Component Analysis with Ranker Search Method (parameters chosen by the system). Use full training set. Provide screenshot of Attribute selection output (with correlation matrix, eigenvalues and eigenvectors). No discussion on the output needed. E. [2p total] Provide screenshot of visualization for lenses dataset (with all attributes). Please change PlotSize, Jitter and Colors from defaults (no need for multiple screenshots, one is enough for one chosen setting). Expand one chosen quadrant to show X/Y point distribution (one only). F. [4p total] There is a variety of applications and projects that use Weka. The full list is available under Further Information –Related Projects menu item. Some interesting examples are: WekaMetal - a meta-learning extension to Weka. Tertius: a system for rule discovery. TClass - classifying multivariate time series. Bayesian Network Classifiers - with bindings for Weka. Agent Academy - Java integrated development framework for creating Intelligent Agents and Multi Agent Systems GeneticProgramming - Genetic Programming Classifier for Weka OpenSubspace - An open source framework for evaluation and exploration of subspace clustering algorithms in WEKA Olex-GA - A genetic algorithm for the induction of rule-based text classifiers Graph RAT - A framework for combining graph and non-graph algorithms TUBE - Tree-based Density Estimation Algorithms Your goal is to choose ONE from the above REDUCED LIST of applications (there are more links on the web site, but they are less relevant to course material), run it and answer the questions below: Written description: Q1. [1p] Name of the chosen Weka project from the above list and one sentence justification why this project/topic was chosen Q2. [1p] Main functionality of the project chosen –one paragraph Q3. [1p] Example of applications (i.e. which data sets/databases can be studied with this tool) Q4. [1p] Your experience with how easy it was to run it –or whether it was possible at all. NOTE: due to the highly distributed and complex nature of the project, some links might be deactivated during the course of the assignment. If the issue persists, please choose an alternative project and inform your TA that project is no longer available. What to submit Submit WRITTEN REPORT as .doc or .pdf file to your TA, according to TA requirements. Course late assignment policy allows for up to 2 days late submission, based on the date and time it is received by your TA, with 10% of your mark penalty for each late day. Sample file for testing your program may be provided by your TA. Collaboration The assignment must be done individually so everything that you hand in must be your original work. Copying another student's work is an academic misconduct.
© Copyright 2026 Paperzz