CYBR7240 Assignment 3_whudso21

.docx

School

Kennesaw State University *

*We aren’t endorsed by this school

Course

7240

Subject

Computer Science

Date

Jan 9, 2024

Type

docx

Pages

10

Uploaded by BaronKnowledge10070

Report
William Hudson CYBR7240 Assignment 3 (5 Points) When presented with a dataset, it is usually a good idea to visualise it first. Go to the  Visualise  tab. Click on any of the scatter plots to open a new window which shows the scatter plot for two selected attributes. Try visualising a scatter plot of  age  and  duration . Do you notice anything unusual? You can click on any data point to display all it's values. There is one outlier in the bottom left of the graph. Info of the outlier is shown on the screenshot
William Hudson CYBR7240 Assignment 3 (5 Points) In the previous point you should have found a data point, which seems to be corrupted, as some of its values are nonsensical. Even a single point like this can significantly affect the performance of a classifier. How do you think it would affect Decision trees? A good way to check this is to test the performance of each classifier before and after removing this datapoint. It would skew the visualization to display more data in the right side of the graph due to several corrupted values being much lower than the “normal” dataset.
William Hudson CYBR7240 Assignment 3 (10 Points) To remove this instance from the dataset we will use a filter. We want to remove all instances, where the age of an applicant is lower than 0 years, as this suggests that the instance is corrupted. In the  Preprocess  tab click on  Choose  in the Filter pane. Select  filters > unsupervised > instance > RemoveWithValues . Click on the text of this filter to change the parameters. Set the attribute index to 13 (Age) and set the split point at 0. Click  Ok  to set the parameters and  Apply  to apply the filter to the data. Visualise the data again to verify that the invalid data point was removed.
William Hudson CYBR7240 Assignment 3 (20 Points) On the  Classify  tab, select the  Percentage split  test option and change its value to 90%. This way, we will train the classifiers using 90% of the training data and evaluate their performance on the remaining 10%. First, train a decision tree classifier with default options. Select  classifiers > trees > J48  and click  Start J48  is the Weka implementation of the  C4.5  algorithm, which uses the normalized information gain criterion to build a decision tree for classification.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help