ISYE 521 Preliminary Report
.docx
keyboard_arrow_up
School
University of Wisconsin, Madison *
*We aren’t endorsed by this school
Course
521
Subject
Health Science
Date
Dec 6, 2023
Type
docx
Pages
4
Uploaded by MagistrateKomodoDragon3072
ISYE 521 Preliminary Report
Diabetes Risk Factors
Abu Syed, Sili Zeng, Hitesh Narra
Problem statement
In our project, we will focus on determining the most important risk factors for assessing an
individual's diabetes risk using machine learning. By identifying these key factors, we aim to
enhance our understanding and educate both ourselves and those who may not be familiar
with diabetes. This is particularly significant given the high prevalence of diabetes in the
United States.
Data
The diabetes prevalence and indicator data are sourced from the 2015 Behavioral Risk Factor
Surveillance System (BRFSS) Survey Data and Documentation, as well as modified by Md.
Shohanur Islam Sobuj and his team, resulting in CSV files under the title "Diabetes Health
Indicators Dataset (2021)." This dataset comprises over 250,000 survey responses and
encompasses 22 risk factors and indicators for predicting diabetes. Our group's initial task
involves an exhaustive exploration of all risk factors within the dataset. We will employ
exploratory data analysis techniques to refine the features and ascertain the most influential
factors in diabetes prediction.
Methods
The group aims to use linear and logistic regression to create an explanatory model that helps
us understand the relationships between risk factors and the presence of diabetes. We have
some features that represent categories like age, income, education, and BMI, while the rest
are binary (yes/no) variables.
We ruled out linear regression because it's not suitable for our dataset with binary values.
Additionally, we don't plan to use regularization since our goal is explanation, not prediction.
Instead, we chose logistic regression, which handles binary variables well and keeps the
probability of having diabetes between 0 and 1.
To get insights into the important features, we're also using a decision tree method called
CART. This creates a tree that shows which factors are significant in predicting diabetes. To
reduce overfitting, we plan to apply a technique called Random Forest.
In a nutshell, we're using logistic regression and decision trees to understand what factors
influence diabetes, especially in a binary context, and we're using Random Forest to help
make our model more reliable.
Initial Findings
Exploratory Data Analysis
To start, we began by checking for missing data in our large dataset using Python.
Fortunately, we found that our dataset had no missing values.
Next, we used a pairwise correlation matrix to identify highly correlated features within our
dataset. Removing highly correlated features is crucial to enhance our model's performance
because these features often convey similar information and may represent the same
underlying relationships. In the correlation matrix, values closer to 1 indicate a strong
positive correlation, while negative values suggest no positive correlation. Values between
0.3 and 0.7 indicate moderate correlations.
In our dataset, we observed that many of the correlation coefficients fell in the range of 0.3 to
0.7, signifying moderate correlations among features. This guided our feature selection
process to improve the overall performance of our model.
Figure 1. Correlation between features
In our future work, we plan to identify and assess highly correlated features. This step is
crucial because the logistic regression model operates under the assumption that its input
variables are independent of each other. For instance, we've already observed that there's a
correlation of 0.52 between general health and physical health. As we continue to enhance
our models, it's likely that we will remove one of these correlated features to ensure the
independence assumption holds. This will contribute to the robustness and accuracy of our
models.
Feature Selection
To enhance our feature selection process, we employed CART decision trees and leveraged
the power of Random Forest. Using CART, we constructed a decision tree that revealed the
importance of various features in predicting diabetes risk. The tree visually showcased which
factors played a significant role in the decision-making process. To further enhance model
reliability and reduce overfitting, we turned to Random Forest, a technique that combines
multiple decision trees to make more robust predictions. This method helped us identify and
prioritize the most critical features while providing a more comprehensive view of the
relationships between risk factors and diabetes presence. Together, these techniques enriched
our feature selection strategy, contributing to a more informed and effective modelling
approach.
Logistic Regression
When conducting logistic regression on our dataset, we initially used all 22 features as
independent variables, with "Diabetes_binary" as the dependent variable (0 for no diabetes, 1
for diabetes) and divided the data into an 80/20 split for training and testing. The initial
model achieved an accuracy of 0.86 in predicting diabetes probability. Subsequently, we
employed feature selection to choose the ten most important features: BMI, Age, Income,
Physical Health, General Health, Education, Mental Health, Blood Pressure, Smoking, and
Fruit Consumption. It's worth noting that some of these features exhibited correlations, which
we plan to address in future iterations of the model. With these ten selected features, we
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help