First of all, I would like to mention that it is more reasonable to compare the models that are based on the same data, so I tried to use the same variables and the same missing value treatment approach (excluding decision tree) to all of the models.
All the 3 models showed a performance of nearly the same quality, according to the various lift charts produced and presented in the further parts of the report.
However, the difference becomes more evident on the % captured response and the most efficient and useful model turns out to be the logistic regression model.
It is described in a greater detail in part 4 of this report.
This ROC plot indicates that the logistic regression is also efficient in terms of trade-off between
…show more content…
2. Recommended Model - Decision Tree
The recommended decision tree model includes 2 variables : annual income and loans, both of them are interval variables and represent the original observations. They were chosen for the final model, because after several trials, they proved to be the key ones in determining the rules within decision trees.
In terms of missing values, nothing particular had to be done, because decision trees conveniently handle missing values by default.
As for the splitting criterion, after getting more knowledge about each of the criteria and performing numerous trials , Gini was chosen, due to its ability to measure the differences between the values of a frequency distribution.
Presented below is the model assessment graph that represents the misclassification rates at each number of leaves.
As can be seen from the graph, the model enables to reduce the difference between the training and actual sets compared to other situations when different settings were used and different variables included.
Another indicator of this model’s usefulness is the lift value graph. The base line represents the nonexistence of our prediction model, while the intercept of the red line states that with this decision tree we can identify 3,7% more bad customers than we would have done without it.
The %
|What criterion must be met |Consistency: Important when comparing data to make sure the data compared was prepared the correct way and done the same each time. |
There are 50 credit customers who were selected for the data collection on five variables such as location, income, size, years, and credit balance. In order to understand more about their customer, AJ DAVIS must use graphical, numerical summary to be able to interpret and better expand their business in the future.
The training and test samples are selected based on the ground truth of the original image of AVIRIS and HYDICE data.
Instead we use the original predictors to predict the response. The original dataset was split into a training set that consists of 75% of the total observations and a test set that consists of 25% of the total observations. Observations were chosen randomly. Supervised learning methods was conducted on the training set to obtain a model, then the model was used on the test set to assess the prediction performance. The values for “K” in KNN were tuned via cross-validation. Due to the volume of the data, the “cost” parameter in the SVM was chosen somewhat ad hoc and the “mtry” parameter in the random forest was chosen as default. The error rates are as
In conclusion, logistic model is better fit for the data than exponential model. They both describe the increasing tendency of the increase rate at first several trails. But only logistic model describes the decreasing tendency of the increase rate at the
These measurements include the assessment of risk factors[61], quality of care[62], diagnostic criteria[63], etc. Most of these studies used rule-based method[62, 63] to detect clearly defined and less complex (fewer expression variations) measurements, such as glucose level and body mass index. For some ambiguous and complex measurements, such as coronary artery disease and obesity status, machine learning plus external terminologies[61] are often
Those three types of tests were combined to make new tests. But the results are all similar to the ones mentioned before.
With the positive coefficients, we will see an increase in one unit of each variable separately compared with the advancement in diabetes. With a 0.05 parameter, the linear regression model selects 5 predictor variables with significance, age, tc, ldl, tch, and glu. To validate the assumption, we can plot the residuals versus the fitted values to see if there are any indications of signs of random distributions. For the residual plot, we see there are no indications or violations of random distribution and can calculate the MSE of the model, which is 3111.265. Next, we will leverage the best subset method to select the predictor variables that are truly impactful to the model.
This study utilized the Worchester Heart Attack Study data and R Studio software to predict the mortality factors for heart attack patients. The medical data include physiological measurements about heart attack patients, which serve as the independent variables, such as the heart rate, blood pressure, atria fibrillation, body mass index, cardiovascular history, and other medical signs. This study employed the techniques of supervised learning and unsupervised learning algorithms, using classification decision trees and k-means clustering, respectively. In addition to performing initial descriptive statistics to estimate the general range of critical factors correlated with heart attack patients, R Studio was used to determine the weight of each of the significant factors on the prediction in order to quantify its influence on the death of heart attack patients. Furthermore, the software was used to evaluate the accuracy of the predicted model to estimate death of heart attack patients by using a confusion matrix to compare predictions with actual data. Finally, this study reflected on the effectiveness of the data mining software conclusions, compared supervised learning and unsupervised learning, and conjectured improvements for future data mining investigations.
Longevity shouldn’t be a factor when you select your roofing as both products score highly in this area. Instead, you can select an option based off of your particular stylistic taste, overall roofing need, and of course your budget.
CO 5124 Data Analysis & Decision Modeling Tutorial : B By Madhumita Srinivasan (12772343) Submitted to Dr.Eddie Chng
Latent class model (LCM) is gaining popularity in health care research. LCM has edge over other conventional modeling as it can incorporate one or more discrete unobserved variables. In addition, it does not depend on traditional assumptions (linear relationship, normal distribution, homogeneity). In their study Santos Silva and Windmeijer (2001) showed that hurdle model is unable to separately identify two decision processes. In health care utilization data, it is very hard to differentiate different illness spell during the one year period. The type of illness may affect both zero and positive outcomes, but, the zero-inflated models only take into account excess zeroes. Latent class models are able to capture this phenomena (Dev and Trivedi
How data mining can assist bankers in enhancing their businesses is illustrated in this example. Records include information such as age, sex, marital status, occupation, number of children, and etc. of the bank?s customers over the years are used in the mining process. First, an algorithm is used to identify characteristics that distinguish customers who took out a particular kind of loan from those who did not. Eventually, it develops ?rules? by which it can identify customers who are likely to be good candidates for such a loan. These rules are then used to identify such customers on the remainder of the database. Next, another algorithm is used to sort the database into cluster or groups of people with many similar attributes, with the hope that these might reveal interesting and unusual patterns. Finally, the patterns revealed by these clusters are then interpreted by the data miners, in collaboration with bank personnel.4
Logistic regression is also called Logit regression or Logit mode. It is developed by David Cox in 1958. Logistic regression is an analogous method for the multiple linear regression. Unlike multiple linear regression, here the outcome or response variable (categorical) is all or none that is the dependent variable is dichotomous or binary. Here the response is defined as an indicator (dummy)
Combination of data exploration, linear regression model and lot of other insights helps Medicx achieve better results.