Beteta - Project Two MAT 303
.docx
keyboard_arrow_up
School
Southern New Hampshire University *
*We aren’t endorsed by this school
Course
303
Subject
Mathematics
Date
Apr 3, 2024
Type
docx
Pages
13
Uploaded by HighnessProton14584
MAT 303 Project Two Summary Report
Diego Beteta
diego.beteta@snhu.edu
Southern New Hampshire University
1. Introduction
In this project, we are exploring a dataset related to heart disease, which includes various health indicators such as age, sex, types of chest pain, blood pressure, cholesterol levels, fasting blood sugar, and maximum heart rate achieved. The primary goal is to use statistical models to predict the risk of heart disease and certain cardiovascular metrics like maximum heart rate. To achieve this, we'll employ logistic regression for binary classification, determining the likelihood of heart disease presence, and random forest models for classifying the risk of heart disease and predicting continuous variables like maximum heart rate. The results from these analyses could potentially be used in a clinical setting to assist healthcare professionals in identifying individuals at higher risk for heart disease, enabling earlier and more targeted interventions.
2. Data Preparation
In this heart disease dataset, key variables include age, sex, types of chest pain (cp), resting blood pressure (trestbps), cholesterol levels (chol), fasting blood sugar (fbs), resting electrocardiographic measurements (restecg), maximum heart rate achieved (thalach), exercise-induced angina (exang), ST depression (oldpeak), the slope of the peak exercise ST segment, the number of major vessels (ca), and thalassemia status (thal). Each variable provides insights into an individual's cardiovascular health and risk factors for heart disease. The dataset comprises 303 rows representing individual patient records and 14 columns, each corresponding to one of the mentioned variables, including the target variable that indicates the presence or absence of heart disease.
3. Model #1 - First Logistic Regression Model
Reporting Results
The general form and the prediction equation of the logistic multiple regression model for heart disease (target) using variables age (age), resting blood pressure (trestbps), exercised induced angina (exang), and maximum heart rate achieved (thalach) are as follows:
General Form: E
(
y
)
=
e
(
β
0
+
β
1
x
1
+
β
2
x
2
+
β
3
x
3
+
β
4
x
4
)
1
+
e
(
β
0
+
β
1
x
1
+
β
2
x
2
+
β
3
x
3
+
β
4
x
4
)
Prediction Equation: ln
(
odds
)
=
β
0
+
β
1
x
1
+
β
2
x
2
+
β
3
x
3
+
β
4
x
4
The prediction equation of this model in terms of the natural log of odds to express the beta terms in linear form:
E
(
y
)
=
e
(−
1.0211
−
0.0175
x
1
−
0.0149
x
2
−
1.625
x
3
+
0.0311
x
4
)
1
+
e
(−
1.0211
−
0.0175
x
1
−
0.0149
x
2
−
1.625
x
3
+
0.0311
x
4
)
In the context of the logistic regression model for predicting heart disease, the terms π and π
1
−
π
have specific meanings related to the probability of an individual having heart disease:
π:
This represents the probability of an individual having heart disease. It is a value between 0 and 1, where 0 means no chance of heart disease and 1 means a certain presence of heart disease. In the context of our model, π is what we are trying to predict based on the input variables (age, resting blood pressure, exercise-induced angina, and maximum heart rate achieved).
π
1
−
π
: This is known as the odds ratio. It's a way of comparing the likelihood of having heart disease (π) to the likelihood of not having heart disease (1-π). For example, if π=0.5, the odds ratio π
1
−
π
is 1, meaning the odds of having heart disease are equal to the odds of not having it. If π > 0.5, the odds ratio is greater than 1, indicating a higher likelihood of heart disease. Conversely, if π <0.5, the odds ratio is less than 1, suggesting a lower likelihood of heart disease.
In logistic regression, we use the natural logarithm of the odds ratio as the outcome variable. The model estimates these log odds as a linear combination of the predictor variables.
Prediction Model Equation:
ln
(
odds
)
=−
1.0211
−
0.0175
age
−
0.0149
trestbps
−
1.625
exang
1
+
0.0311
thalach
The estimated coefficient for the maximum heart rate achieved (thalach) in our logistic regression model
is 0.031095. This indicates that in our model, for each one-unit increase in maximum heart rate (measured in beats per minute), the log odds of having heart disease increase by 0.031095. This positive coefficient suggests that higher maximum heart rates are associated with a greater likelihood of heart disease in our dataset. It's crucial to interpret this finding within the context of our model and other influencing factors; a higher heart rate is not a direct cause of heart disease but shows an association in the population we studied. This interpretation should be integrated with clinical insights and other relevant variables for a comprehensive understanding.
Evaluating Model Significance
The Hosmer-Lemeshow test statistic is based on a chi-square distribution, and the degrees of freedom are typically calculated as the number of groups minus 2. The test resulted in a chi-square value of 44.622 with 48 degrees of freedom, leading to a P-value of 0.612.
Given that the P-value is significantly higher than the 0.05 (5%) significance level, we do not reject the null hypothesis. This implies that no substantial evidence suggests that our model fails to fit the data appropriately. According to the Hosmer-Lemeshow goodness of fit test, our model seems suitable for the data.
Such a result is favorable in the context of logistic regression analysis, as it indicates that the predictions made by our model are in good agreement with the actual observed data across various subgroups.
To determine which terms in the logistic regression model are significant based on Wald's test, we look at the p-values associated with each term. Wald's test evaluates the significance of each coefficient in the model. At a 5% level of significance, a term is considered significant if its p-value is less than 0.05.
Age (age)
: p-value = 0.3060. This is greater than 0.05, meaning age is not statistically significant at the 5% level.
Resting Blood Pressure (trestbps)
: p-value = 0.0741. This is also greater than 0.05, indicating that resting blood pressure is not statistically significant at the 5% level.
Exercise-Induced Angina (exang1)
: p-value ≈ 0.000000107 (1.07e-07). This is much less than 0.05, meaning exercise-induced angina is statistically significant at the 5% level.
Maximum Heart Rate Achieved (thalach)
: p-value ≈ 0.0000192 (1.92e-05). This is also much less
than 0.05, indicating that maximum heart rate achieved is statistically significant at the 5% level.
Based on Wald's 5% significance level test, the terms 'exercise-induced angina' and 'maximum heart rate
achieved' are significant in the model. However, the terms 'age' and 'resting blood pressure' are not significant at this level.
Based on the confusion matrix from our model, here are the counts for true positives, true negatives, false positives, and false negatives:
True Negatives (TN)
: 89 - The model correctly predicted 'no heart disease' (default=0) in 89 cases.
False Positives (FP)
: 49 - The model incorrectly predicted 'heart disease' (default=1) in 49 cases without heart disease.
False Negatives (FN)
: 31 - The model incorrectly predicted 'no heart disease' in 31 cases with heart disease.
True Positives (TP)
: 134 - The model correctly predicted 'heart disease' (default=1) in 134 cases.
Based on the confusion matrix of our model, the following metrics are calculated:
Accuracy
: Approximately 73.60. This means that the model correctly predicts whether heart disease is present or not in about 73.60% of cases.
Precision
: About 73.22. This indicates that when the model predicts heart disease, it is correct 73.22% of the time.
Recall (Sensitivity)
: Approximately 81.21. This means that the model correctly identifies 81.21% of the actual cases of heart disease.
The ROC curve depicts the True Positive Rate (TPR, or sensitivity) versus the False Positive Rate (FPR, or 1 - specificity) across various thresholds. A curve that approaches the top left corner reflects a desirable high TPR and a low FPR, indicating that the model is highly adept at differentiating between the classes of interest, in this case, those with heart disease and those without.
The AUC for this model is approximately 0.8007, which is considered very good. This value signifies the likelihood that the model will accurately rank a randomly chosen positive instance (a case with heart disease) higher in terms of risk than a randomly chosen negative instance (a case without heart disease).
An AUC near 1.0 is indicative of a model with excellent predictive prowess.
The ROC curve is instrumental in evaluating the performance of a classification model, especially when the balance between sensitivity and specificity is critical. A high AUC value, as seen with this model, suggests that it can distinguish between individuals with and without heart disease.
Making Predictions Using Model
1.
The probability of an individual having heart disease at 50 years old, with a resting blood pressure of 122, exercise-induced angina, and a maximum heart rate of 140, is 27.16%.
With a probability value of 27.16%, the model suggests that individuals with these characteristics are moderately likely to have heart disease. This probability indicates a moderate risk of heart disease for individuals with this profile, highlighting the importance of considering these factors in assessing heart disease risk.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help