Assignment 1: Using the WEKA Workbench A. Become familiar with the use of the WEKA workbench to invoke several different machine learning schemes. Use latest stable version. Use both the graphical interface (Explorer) and command line interface (CLI). See Weka home page for Weka documentation. B. Use the following learning schemes, with the default settings to analyze the weather data (in weather.arff). For test options, first choose "Use training set", then choose "Percentage Split" using default 66% percentage split. Report model percent error rate. ZeroR (majority class) OneR Naive Bayes Simple J4.8 C. Which of these classifiers are you more likely to trust when determining whether to play? Why? D. What can you say about …show more content…
1,2,..38) and an Affymetix "call" (P is gene is present, A if absent, M if marginal). Think of the training data as a very tall and narrow table with 7130 rows and 78 columns. Note that it is "sideways" from machine learning point of view. That is the attributes (genes) are in rows, and observations (samples) are in columns. This is the standard format for microarray data, but to use with machine learning algorithms like WEKA, we will need to do "matrix transpose" (flip) the matrix to make files with genes in columns and samples in rows. We will do that in step 3B.6 of this assignment. Here is a small extract Gene Description Gene Accession Number 1 call 2 call ... GB DEF = GABAa receptor alpha-3 subunit A28102_at 151 A 263 P ... ... AB000114_at 72 A 21 A ... ... AB000115_at 281 A 250 P ... ... AB000220_at 36 A 43 A ... 3B: Clean the data Perform the following cleaning steps on both the train and test sets. Use unix tools, scripts or other tools for each task. Document all the steps and create intermediate files for each step. After each step, report the number of fields and records in train and test files. (Hint: Use unix command wc to find the number of records and use awk or gawk to find the number of fields). Microarray Data Cleaning Steps 1. Remove the initial records with Gene Description containing "control". (Those are Affymetrix controls, not human genes). Call the resulting files ALL_AML_grow.train.noaffy.tmp and
b) Label: Homologous chromososmes, sister chromatids, nonsister chromatids, chiasmata, synapsis, tetrad, centrioles, centromere, centrosome, metaphase plate, spindle fibers, microtubules, germ cell, gamete cells
p2 + 2pq + q2 = 1 ; where ‘p2’ represents the homozygous dominant genotype, ‘2pq’ represents the heterozygous genotype, and ‘q2’ represents the homozygous recessive genotype
• Use these data to construct a map of these three genes, showing your work along the way. The
A 3. C 4. D 5. B 6. B 7.
List the inputs, any processes/calculations, and outputs. Use the same valid variable names you used in Step 1.
C. A chromatid is a chromosome that has been replicated but has not yet separated from its sister chromatid.
For implementation I used Java and used derby database to manipulate the dataset. Using derby database instead of using the file directly may seem irrelevant and irrational but it allowed me to use SQL code which is ideal with manipulating bulks of data. It also grants flexibility and maintainability to the dataset and allows it to be swiftly modified updated or altered.
In this module, the class label for the testing data is predicted. The n – dimensional feature vector for the testing data is converted from query tree of testing data in the manner similar to the data pre – processing phase. The SQLIA classifier determines the new testing feature vector is normal or malicious, by using optimized SVM classification model.
Training an artificial Neural Network involves choosing and allowing models for models which there are several associated algorithms.
c. Now you can see how my rival made a mistake because they didn’t evaluate and understood the facts as clearly as they should.
Pairs of alternative traits were expressed in the F2 generation on the ratio of ¾ dominant to ¼ recessive (3:1 segregation ratio referred to as Mendelian ratio)
Empty and clean the test tubes and repeat steps 2-10 again for a second dataset.