Sample Resume : Data Mining

995 Words4 Pages
CAP4770 ? Introduction to Data Mining
Final Project Instruction
Data:
We use a gene data set as our data for the final project. This data set is in attributes-in-rows format, comma-separated values. It can be downloaded by following this link: http://users.cis.fiu.edu/~lli003/teaching/hw-sol/finalproject_datafiles.zip Username/Password: CAP4770/student
The zip file contains three files: train.csv: training data, consisting of 69 instances with 7,070 attributes. train_class.txt: training classes, corresponding to the true labels for each instance in training data in the order. There are 5 classes in total, MED, MGL, RHB, JPA and EPD. test.csv: test data, consisting of 112 unlabeled instances with 7,070 attributes.
Goal:
Learn the best
…show more content…
You need to do some preprocessing, e.g., feature selection to select N attributes that are more important and predictive, by using Weka or writing a java program.
You can use any classification method available in Weka or some other tools, as long as you explain well why you choose it in your report.
You can use ensemble classification to get more reasonable classifier; however, you need to consider carefully from technical perspective how to integrate different classifiers together to get better results.
It is better if I can reconstruct your classifier with my own environment; therefore, you?d better provide me a specific instruction in your report on how to build your own classifier.
More accurate your classifier is, more detailed explanation your report has, more points you get.

The following steps suggest one way of finding the best classifier. Note that this is just one way for doing the project. You can definitely make improvements or use other ways to get a classifier with higher accuracy. Data Cleaning
Threshold both train and test data to a minimum value of 20, maximum of 16,000.
Selecting top genes (i.e., attributes) by class
You can use any feature selection methods from Weka.
Here is one method: remove from train data genes with fold differences across samples less than 2; (Note that Fold difference is defined as (max-min)/2 where. max and min are the maximum and minimum values of the gene expression for all the
Get Access