First of all, I would like to mention that it is more reasonable to compare the models that are based on the same data, so I tried to use the same variables and the same missing value treatment approach (excluding decision tree) to all of the models.

All the 3 models showed a performance of nearly the same quality, according to the various lift charts produced and presented in the further parts of the report.

However, the difference becomes more evident on the % captured response and the most efficient and useful model turns out to be the logistic regression model.

It is described in a greater detail in part 4 of this report.

This ROC plot indicates that the logistic regression is also efficient in terms of trade-off between*…show more content…*

2. Recommended Model - Decision Tree

The recommended decision tree model includes 2 variables : annual income and loans, both of them are interval variables and represent the original observations. They were chosen for the final model, because after several trials, they proved to be the key ones in determining the rules within decision trees.

In terms of missing values, nothing particular had to be done, because decision trees conveniently handle missing values by default.

As for the splitting criterion, after getting more knowledge about each of the criteria and performing numerous trials , Gini was chosen, due to its ability to measure the differences between the values of a frequency distribution.

Presented below is the model assessment graph that represents the misclassification rates at each number of leaves.

As can be seen from the graph, the model enables to reduce the difference between the training and actual sets compared to other situations when different settings were used and different variables included.

Another indicator of this model’s usefulness is the lift value graph. The base line represents the nonexistence of our prediction model, while the intercept of the red line states that with this decision tree we can identify 3,7% more bad customers than we would have done without it.

The %

All the 3 models showed a performance of nearly the same quality, according to the various lift charts produced and presented in the further parts of the report.

However, the difference becomes more evident on the % captured response and the most efficient and useful model turns out to be the logistic regression model.

It is described in a greater detail in part 4 of this report.

This ROC plot indicates that the logistic regression is also efficient in terms of trade-off between

2. Recommended Model - Decision Tree

The recommended decision tree model includes 2 variables : annual income and loans, both of them are interval variables and represent the original observations. They were chosen for the final model, because after several trials, they proved to be the key ones in determining the rules within decision trees.

In terms of missing values, nothing particular had to be done, because decision trees conveniently handle missing values by default.

As for the splitting criterion, after getting more knowledge about each of the criteria and performing numerous trials , Gini was chosen, due to its ability to measure the differences between the values of a frequency distribution.

Presented below is the model assessment graph that represents the misclassification rates at each number of leaves.

As can be seen from the graph, the model enables to reduce the difference between the training and actual sets compared to other situations when different settings were used and different variables included.

Another indicator of this model’s usefulness is the lift value graph. The base line represents the nonexistence of our prediction model, while the intercept of the red line states that with this decision tree we can identify 3,7% more bad customers than we would have done without it.

The %

Related

- Decent Essays
## Macroeconomics In The United States

- 560 Words
- 3 Pages

The study and application of macroeconomics influences the well-being of a nation by achieving high rates of material production and by keeping track of how much of something is being consumed. The United States is one of the wealthiest countries in the globe, making the government powerful. Government intervention in the Untied States is an important factor that keeps the economy running. Enough power to control the business cycle keeps money circulating the nation. The business cycle includes economic downturns, classified as recessions, expansions, business-cycle peaks and troughs. A good government is essential for the economy to run smoothly. There are three main macroeconomic variables in the nation that the government focuses on, Gross Domestic Product (GDP), unemployment rate, and inflation rate.

- 560 Words
- 3 Pages

Decent Essays - Decent Essays
## Comprehensive Severity Index (CSI)

- 474 Words
- 2 Pages

These measurements include the assessment of risk factors[61], quality of care[62], diagnostic criteria[63], etc. Most of these studies used rule-based method[62, 63] to detect clearly defined and less complex (fewer expression variations) measurements, such as glucose level and body mass index. For some ambiguous and complex measurements, such as coronary artery disease and obesity status, machine learning plus external terminologies[61] are often

- 474 Words
- 2 Pages

Decent Essays - Decent Essays
## One Brain or Two? (Psychology) Essay

- 630 Words
- 3 Pages

Those three types of tests were combined to make new tests. But the results are all similar to the ones mentioned before.

- 630 Words
- 3 Pages

Decent Essays - Satisfactory Essays
## Exponential Model: The Growth Rate Of Zombie Population

- 393 Words
- 2 Pages

In conclusion, logistic model is better fit for the data than exponential model. They both describe the increasing tendency of the increase rate at first several trails. But only logistic model describes the decreasing tendency of the increase rate at the

- 393 Words
- 2 Pages

Satisfactory Essays - Decent Essays
## Hca 270 Week 6 Comparative Data Essay

- 624 Words
- 3 Pages

|What criterion must be met |Consistency: Important when comparing data to make sure the data compared was prepared the correct way and done the same each time. |

- 624 Words
- 3 Pages

Decent Essays - Decent Essays
## MATH 533 Course Project Data AJ DAVIS

- 1178 Words
- 5 Pages

There are 50 credit customers who were selected for the data collection on five variables such as location, income, size, years, and credit balance. In order to understand more about their customer, AJ DAVIS must use graphical, numerical summary to be able to interpret and better expand their business in the future.

- 1178 Words
- 5 Pages

Decent Essays - Decent Essays
## Nt1310 Unit 3 Test Report

- 622 Words
- 3 Pages

The training and test samples are selected based on the ground truth of the original image of AVIRIS and HYDICE data.

- 622 Words
- 3 Pages

Decent Essays - Satisfactory Essays
## Pt1420 Unit 3 Agression Analysis

- 245 Words
- 1 Pages

This algorithm was simulated with Matlab. These datasets and the mentioned characteristics are considered and the algorithm of each dataset with different slopes for the activation fumcion of interest were evaluated so that the best slope can be obtained. After running the program for several times and computing the average to obtain the best result, the optimum slope was evaluated for each dataset and the best slopes for Breast Cancer, Diabetes, Bupa, and

- 245 Words
- 1 Pages

Satisfactory Essays - Good Essays
## Pt2520 Unit 6 Data Mining Project

- 1667 Words
- 7 Pages

Instead we use the original predictors to predict the response. The original dataset was split into a training set that consists of 75% of the total observations and a test set that consists of 25% of the total observations. Observations were chosen randomly. Supervised learning methods was conducted on the training set to obtain a model, then the model was used on the test set to assess the prediction performance. The values for “K” in KNN were tuned via cross-validation. Due to the volume of the data, the “cost” parameter in the SVM was chosen somewhat ad hoc and the “mtry” parameter in the random forest was chosen as default. The error rates are as

- 1667 Words
- 7 Pages

Good Essays - Good Essays
## Data Mining Essay

- 4465 Words
- 18 Pages
- 12 Works Cited

How data mining can assist bankers in enhancing their businesses is illustrated in this example. Records include information such as age, sex, marital status, occupation, number of children, and etc. of the bank?s customers over the years are used in the mining process. First, an algorithm is used to identify characteristics that distinguish customers who took out a particular kind of loan from those who did not. Eventually, it develops ?rules? by which it can identify customers who are likely to be good candidates for such a loan. These rules are then used to identify such customers on the remainder of the database. Next, another algorithm is used to sort the database into cluster or groups of people with many similar attributes, with the hope that these might reveal interesting and unusual patterns. Finally, the patterns revealed by these clusters are then interpreted by the data miners, in collaboration with bank personnel.4

- 4465 Words
- 18 Pages
- 12 Works Cited

Good Essays - Decent Essays
## The Latent Class Model In Health Care

- 295 Words
- 2 Pages

Latent class model (LCM) is gaining popularity in health care research. LCM has edge over other conventional modeling as it can incorporate one or more discrete unobserved variables. In addition, it does not depend on traditional assumptions (linear relationship, normal distribution, homogeneity). In their study Santos Silva and Windmeijer (2001) showed that hurdle model is unable to separately identify two decision processes. In health care utilization data, it is very hard to differentiate different illness spell during the one year period. The type of illness may affect both zero and positive outcomes, but, the zero-inflated models only take into account excess zeroes. Latent class models are able to capture this phenomena (Dev and Trivedi

- 295 Words
- 2 Pages

Decent Essays - Decent Essays
## The US Economy

- 722 Words
- 3 Pages

The United States is currently experiencing a slow recovery from the recession of 2008-09. The current unemployment rate is 7.7%, which is the lowest level since December of 2008 (BLS, 2012). However, this rate is believed to higher than the rate that would occur if the economy was operating at peak efficiency, and it is also believed that there are structural issues still underpinning this performance. For example, the number of Americans who have exited the work force as the result of prolonged unemployment is believed to be higher than usual. In addition, the Congressional Budget Office (CBO, 2012) notes that long-term unemployment of greater than 26 weeks is at a much higher rate than normal, which will have adverse long-run effects on the economy, since workers with long-term unemployment often find their career paths derailed.

- 722 Words
- 3 Pages

Decent Essays - Good Essays
## The Dataset Diabetes Details From Efron Et Al

- 1095 Words
- 5 Pages

With the positive coefficients, we will see an increase in one unit of each variable separately compared with the advancement in diabetes. With a 0.05 parameter, the linear regression model selects 5 predictor variables with significance, age, tc, ldl, tch, and glu. To validate the assumption, we can plot the residuals versus the fitted values to see if there are any indications of signs of random distributions. For the residual plot, we see there are no indications or violations of random distribution and can calculate the MSE of the model, which is 3111.265. Next, we will leverage the best subset method to select the predictor variables that are truly impactful to the model.

- 1095 Words
- 5 Pages

Good Essays - Better Essays
## The Heart Attack Study Data And R Studio Software Essay

- 2084 Words
- 9 Pages

This study utilized the Worchester Heart Attack Study data and R Studio software to predict the mortality factors for heart attack patients. The medical data include physiological measurements about heart attack patients, which serve as the independent variables, such as the heart rate, blood pressure, atria fibrillation, body mass index, cardiovascular history, and other medical signs. This study employed the techniques of supervised learning and unsupervised learning algorithms, using classification decision trees and k-means clustering, respectively. In addition to performing initial descriptive statistics to estimate the general range of critical factors correlated with heart attack patients, R Studio was used to determine the weight of each of the significant factors on the prediction in order to quantify its influence on the death of heart attack patients. Furthermore, the software was used to evaluate the accuracy of the predicted model to estimate death of heart attack patients by using a confusion matrix to compare predictions with actual data. Finally, this study reflected on the effectiveness of the data mining software conclusions, compared supervised learning and unsupervised learning, and conjectured improvements for future data mining investigations.

- 2084 Words
- 9 Pages

Better Essays - Decent Essays
## Data Analysis Golf Course Design

- 1491 Words
- 6 Pages

CO 5124 Data Analysis & Decision Modeling Tutorial : B By Madhumita Srinivasan (12772343) Submitted to Dr.Eddie Chng

- 1491 Words
- 6 Pages

Decent Essays