Tutorial _W10_slides (4)

pptx

School

Queensland University of Technology *

*We aren’t endorsed by this school

Course

509

Subject

Computer Science

Date

Dec 6, 2023

Type

pptx

Pages

48

Uploaded by CountGalaxy14444

Report
IFN509 Data Exploration and Mining Week-10(Tutorial)-Predictive Modelling – Decision Tree Thiru Balasubramaniam
Lecture topic for Weeks 8- 11 Predictive Mining Predictive mining process Decision tree classification Linear and logistic regression Neural Networks K-nearest neighbour (a brief introduction)
Outline this week’s tutorial Part 1: Reflective Exercises(60 min) Exercise 1: Predictive mining Introduction: Basics Exercise 2: Predictive modelling - Process Exercise 3: Decision trees - Introduction Part 2: Practical Exercises(60 min) 1. Preparing data for predictive mining using ‘veteran.csv’ 2. Building your first decision tree model
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Part 1: Reflective Exercises
Exercise 1: Predictive mining Introducti on: Basic 1. Compare Classification, Clustering and Association Mining Prediction Clustering Association Application Used in forecasting Used in description Used in description Techniques Decision trees, ANN, regression K-means, Hierarchical Apriori (generate and test), FP-tree on Type Supervised Unsupervised Unsupervised (Counting Frequencies) Measures Accuracy, R-sqaure, Precision, Recall, ROC curve Inter & Intra similarity: Silhouette coefficient Purity, Entropy Support, Confidence, Lif Output Prediction of a target variable; Rules, trees, networks Clusters Frequent patterns, association rules Input Multi-variables With class values (best if all filled) Dense data set Multi-variables Without class values (best if all filled) Dense data set Multi-variables (only very few present in a record) Sparse data set
Exercise 1: Predictive mining Introducti on: Basic 2. State the difference and similarity between classification rules and association rules. Classification Rules Focus on one target variable Need to specify class (or label) in all cases Only the class attribute (target) can be in RHS of the rule Measures: Accuracy, Comprehensibility Association Rules Many target variables Do not need to specify a class in any case Any combinations of attributes can be in RHS of the rule Measures: Support, Confidence, Lif
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exercise 1: Predictive mining Introducti on: Basic 3. Supervised learning and unsupervised clustering both require at least one a. hidden attribute. b. output attribute. c. input attribute. d. categorical attribute. Ans: c
Exercise 1: Predictive mining Introducti on: Basic 4. Supervised learning differs from unsupervised clustering in that supervised learning requires a. at least one input attribute. b. input attributes to be categorical. c. at least one output attribute. d. output attributes to be categorical. Ans: c
Exercise 1: Predictive mining Introducti on: Basic 5. Suppose you are a luxury automobile dealer. Your dealership is planning to sell a new model. Which predictive or descriptive methods (i.e., classification, regression, and clustering) will be most appropriate to answer the following questions: a. How much should you charge for the new BMW X6M? b. How likely the person X will buy the new BMW X5W? c. What type of customers has bought the silver BMW M5? List their features. Solution: 5. a – Regression (prediction of a value) b – Classification (prediction of a class) c – Clustering (describing the characteristics of customers using features).
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exercise 1: Predictive mining Introducti on: Basic 6. Determine which is the best approach for each problem: a. supervised learning b. unsupervised clustering c. data query 1. What is the average weekly salary of all female employees under forty years of age? 2. Develop a profile for credit card customers likely to carry an average monthly balance of more than $1000.00. 3. Determine the characteristics of a successful used car salesperson. 4. Identify similarities that a group of customers holding one or more insurance policies have? 5. Do meaningful attribute relationships exist in a database containing information about credit card customers? 6. Do single men play more golf than married men? 7. Determine whether a credit card transaction is valid or fraudulent.
Exercise 1: Predictive mining Introducti on: Basic 6. Determine which is the best approach for each problem: Solution Data Query (1) What is the average weekly salary of all female employees under forty years of age? (6) Do single men play more golf than married men? Supervised learning (2) Develop a profile for credit card customers likely to carry an average monthly balance of more than $1000.00. (3) Determine the characteristics of a successful used car salesperson. (7) Determine whether a credit card transaction is valid or fraudulent. Unsupervised learning (4) Identify similarities that a group of customers holding one or more insurance policies have? (5) Do meaningful attribute relationships exist in a database containing information about credit card customers?
Exercise 1: Predictive mining Introducti on: Basic 7. Which statement is true about predictive mining problems? a. The output attribute must be categorical. b. The output attribute must be numeric. c. The resultant model is designed to determine future outcomes. d. The resultant model is designed to classify current behaviour. Ans: c
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exercise 1: Predictive mining Introducti on: Basic 8. Suppose a city council has a dataset collected from a housing study – a market response study with the markets as the city metropolitan area. The following are the variables included in the study: 1.Name of the suburb 2.The estimated value of the house in dollars 3.Air pollution 4.Crime rate 5.Percent of land zoned for lots 6.Percent of business that is industrial or nonretail 7.On the River or not 8.The average number of rooms per home 9.Percentage of home built before 1995 10.Weighted distance to CBD 11.Accessibility to highways 12.Tax rate 13.Pupil/teacher ratio in public schools 14.Percentage of the population of lower socioeconomic status. Do you think the use of predictive mining on this data will lead you to find the useful information required for decision making? Describe your strategy of using predictive mining on this data set.
Exercise 1: Predictive mining Introducti on: Basic 8. Suppose a city council has a dataset collected from a housing study – a market response study with the markets as the city metropolitan area. The following are the variables included in the study: 1.Name of the suburb 2.The estimated value of the house in dollars 3.Air pollution 4.Crime rate 5.Percent of land zoned for lots 6.Percent of business that is industrial or nonretail 7.On the River or not 8.The average number of rooms per home 9.Percentage of home built before 1995 10.Weighted distance to CBD 11.Accessibility to highways 12.Tax rate 13.Pupil/teacher ratio in public schools 14.Percentage of the population of lower socioeconomic status. 8. Yes. For example, to predict a property’s estimated value, all variables but the name of the suburb can be used for analysis. For different prediction scenarios, different variables can be selected. For instance, variable 4,5,6,9,13, and 14 can be used to predict if there is a need for a new high school; variable 5,6,7,9,10,11,13, and 14 can be used to predict if there is a need of a new bus route.
Exercise 2: Predictive modelling - Process 1. For the given test dataset, identify the accuracy of the model. Solution: 3/4
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exercise 2: Predictive modelling - Process 2. For the given new instance, what will be the model prediction. Solution: Good
Exercise 2: Predictive modelling - Process 3. Out of the given two models, which one is overfitting, and which one is better fitting? Solution: a. overfitting b. better fitting
Exercise 2: Predictive modelling - Process 4. Consider the following training dataset with the imbalanced class distribution (red 30 and blue -4). Which model overfits the training data? Also, how do you distribute this data using undersampling and oversampling? Solution: Model A overfits the data. Undersampling (4 red and 4 blue data points); Oversampling (30 red and 30 blue data points)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exercise 2: Predictive modelling - Process 5. Given the following graph showing overfitting, identify which graph represents the training error and which represents the test error. Solution: Graph A: Test error; Graph B: Training error
Exercise 2: Predictive modelling - Process 6. For the given dataset with actual class and predicted class, measure the following: a. TP b. FP c. TN d. FN e. TPR f. FPR g. Precision h. Recall i. F1 j. Specificity
21 Confusion Matrix The confusion matrix (can be generalized to multi-class) Machine Learning methods usually minimize FP+FN TPR (True Positive Rate): TP / (TP + FN) FPR (False Positive Rate): FP / (TN + FP) Predicted class Yes No Actual class Yes TP: True positive FN: False negative No FP: False positive TN: True negative
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
22 Classification measures Precision: Proportion of all positive predictions by the model that are correct. measures how many positive predictions are actual positive observations. Precision or Accuracy = TP/(TP+FP) Recall: Proportion of all real positive observations that are correct. measures how many actual positive observations are predicted correctly. Recall or Coverage or Sensitivity = TP/(TP+FN) = TPR F1: The harmonic mean (average) of precision and recall. F-measure =(2 recall precision)/(recall + precision) Specificity: Proportion of all negative predictions that are correct. Specificity = TN / (FP + TN) = 1 – FPR
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exercise 2: Predictive modelling - Process 6. For the given dataset with actual class and predicted class, measure the following: a. TP b. FP c. TN d. FN e. TPR f. FPR g. Precision h. Recall i. F1 j. Specificity a. TP = 2 b. FP = 1 c. TN = 4 d. FN = 1 e. TPR = 2/3 f. FPR = 1/5 g. Precision = 2/3 h. Recall = 2/3 i. F1 = 2/3 j. Specificity = 4/5
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exercise 2: Predictive modelling - Process 7. Can classification techniques such as decision tree and neural network be used in predicting continuous values? Can regression techniques be used in classifying discrete classes (such as binary, nominal and ordinal)?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exercise 3: Decision trees- Introducti on From the given example, which attributes splits the samples into subsets well? Solution: a. The rules can split the target class. But in b, all the rules lead to 50-50 split which is equivalent to a random prediction.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exercise 3: Decision trees- Introducti on 1. Based on the given decision tree a. Identify the accuracy for the following instances Solution: a. 8/8
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exercise 3: Decision trees- Introducti on 1. Based on the given decision tree b. Identify the accuracy for the following instances Solution: b. 3/4
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exercise 3: Decision trees- Introducti on 1. Based on the given decision tree c. How many rules are there in the decision tree? Total Number of Rules Generated: 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exercise 3: Decision trees- Introducti on 2. Draw a decision tree to represent the following Boolean function: A and not B The discrete logic tells us that the decision is always true when both inputs are true (1). To draw a decision tree for this boolean function, first, the decision table is to be drawn. Now, the decision tree can be built for this table as ,
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exercise 3: Decision trees- Introducti on 3. Consider the InsuranceInfo relation shown in the training data. It contains information about the marketing campaign of an Insurance Company. The first three columns show the Refund, Marital Status, and Income of a Potential customer and the last column (Cheat) shows whether the person has cheated the company. a. Construct a decision tree using this data that helps to predict whether a person is going to cheat the company.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exercise 3: Decision trees- Introducti on 3. Consider the InsuranceInfo relation shown in the training data. It contains information about the marketing campaign of an Insurance Company. The first three columns show the Refund, Marital Status, and Income of a Potential customer and the last column (Cheat) shows whether the person has cheated the company. a. Construct a decision tree using this data that helps to predict whether a person is going to cheat the company.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exercise 3: Decision trees- Introducti on 3. Consider the InsuranceInfo relation shown in the training data. It contains information about the marketing campaign of an Insurance Company. The first three columns show the Refund, Marital Status, and Income of a Potential customer and the last column (Cheat) shows whether the person has cheated the company. b. Test your decision tree on the given test data Table 2 and calculate the classification accuracy of your constructed tree. Solution b. Classification accuracy on test data (Table 2) is: 66.6%. Instances 3 and 4 are incorrectly classified by the generated tree.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exercise 3: Decision trees- Introducti on 3. Consider the InsuranceInfo relation shown in the training data. It contains information about the marketing campaign of an Insurance Company. The first three columns show the Refund, Marital Status, and Income of a Potential customer and the last column (Cheat) shows whether the person has cheated the company. c. What is the heuristic used in constructing the decision tree during the training process? c. The heuristic of “selecting the attribute based on that it will best separate the samples into individual classes” is used.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exercise 3: Decision trees- Introducti on 3. Consider the InsuranceInfo relation shown in the training data. It contains information about the marketing campaign of an Insurance Company. The first three columns show the Refund, Marital Status, and Income of a Potential customer and the last column (Cheat) shows whether the person has cheated the company. d. Do we require any domain knowledge in building this tree? e. What is the general trend of a fraudster? What type of customers will likely to cheat the company? f. Given the following test data, what will be the answer?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exercise 3: Decision trees- Introducti on d. Do we require any domain knowledge in building this tree? e. What is the general trend of a fraudster? What type of customers will likely to cheat the company? Solution: d. No domain knowledge is required in building this tree. Solution: e. Marital-Status = Single Marital-Status = Married and Income = High and Refund = No Marital-Status = Divorced and Income = Medium
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exercise 3: Decision trees- Introducti on f. Given the following test data, what will be the answer?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exercise 3: Decision trees- Introducti on 4. For the given decision tree: convert it into rules. Solution: Rule 1: If business appointment = No & Temp above 70 = No then decision = wear jeans Rule 2: If business appointment = No & Temp above 70 = Yes then decision = wear shorts Rule 3: If business appointment = Yes then decision = wear slack
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exercise 3: Decision trees- Introducti on 5. Assume that you have a widget factory and you want to understand which kinds of customers are buying your widgets. Below, you see a very simple example of a data set. There are columns for the attributes Name, Salary, Sex, Age, and Buy widget. a) Find out the general trends about buying the widgets. Construct a decision tree. b) We would like to know the effect of age in the pattern of behaviour. Construct a decision tree. Predict how old customers will be typically buying the widgets? c) Given the test data in (b), what will be the answer?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Exercise 3: Decision trees- Introducti on Solution: a. Only males with a high salary are buying the widgets. b. Males with high salary and younger or equal to the age of 57, and males with low salary and older than 40 years will be buying the widgets. c.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Time for Computer Exercise[~70”] T utorial_W10.ipynb ” from QUT’s Canvas https://pixabay.com/en/photos/school/?cat=industry
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Predictive Mining
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Dm_tools.py
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Evaluating the test dataset
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Understanding the importance of variables
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In what sense is the optimal tree best? Tree has lowest or near lowest cost as determined by a test procedure Tree should exhibit very similar accuracy when applied to new data BUT Tree is NOT necessarily unique — other tree structures may be as good Other tree structures might be found by Using other learning data Using other growing rules Running CV with a different random number seed Some variability of results needs to be expected Overall story should be similar for good size data sets
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Cross-Validation Purpose is to protect oneself from over fitting errors Ideally would like to use large test data sets to evaluate trees, N>5000 Practically, some studies don't have sufficient data to spare for testing Cross-validation will use SAME data for learning and for testing
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Cross-Validation Procedure 1 10 2 9 3 4 5 6 7 8 1 10 2 9 3 4 5 6 7 8 1 10 2 9 3 4 5 6 7 8 1 10 2 9 3 4 5 6 7 8 Test Test Test Test Learn Learn Learn Learn Learn etc... Learn
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Next Week… Regression
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help