ISYE 521 Preliminary Report

.docx

School

University of Wisconsin, Madison *

*We aren’t endorsed by this school

Course

521

Subject

Health Science

Date

Dec 6, 2023

Type

docx

Pages

Uploaded by MagistrateKomodoDragon3072

ISYE 521 Preliminary Report Diabetes Risk Factors Abu Syed, Sili Zeng, Hitesh Narra Problem statement In our project, we will focus on determining the most important risk factors for assessing an individual's diabetes risk using machine learning. By identifying these key factors, we aim to enhance our understanding and educate both ourselves and those who may not be familiar with diabetes. This is particularly significant given the high prevalence of diabetes in the United States. Data The diabetes prevalence and indicator data are sourced from the 2015 Behavioral Risk Factor Surveillance System (BRFSS) Survey Data and Documentation, as well as modified by Md. Shohanur Islam Sobuj and his team, resulting in CSV files under the title "Diabetes Health Indicators Dataset (2021)." This dataset comprises over 250,000 survey responses and encompasses 22 risk factors and indicators for predicting diabetes. Our group's initial task involves an exhaustive exploration of all risk factors within the dataset. We will employ exploratory data analysis techniques to refine the features and ascertain the most influential factors in diabetes prediction. Methods The group aims to use linear and logistic regression to create an explanatory model that helps us understand the relationships between risk factors and the presence of diabetes. We have some features that represent categories like age, income, education, and BMI, while the rest are binary (yes/no) variables. We ruled out linear regression because it's not suitable for our dataset with binary values. Additionally, we don't plan to use regularization since our goal is explanation, not prediction. Instead, we chose logistic regression, which handles binary variables well and keeps the probability of having diabetes between 0 and 1. To get insights into the important features, we're also using a decision tree method called CART. This creates a tree that shows which factors are significant in predicting diabetes. To reduce overfitting, we plan to apply a technique called Random Forest. In a nutshell, we're using logistic regression and decision trees to understand what factors influence diabetes, especially in a binary context, and we're using Random Forest to help make our model more reliable. Initial Findings Exploratory Data Analysis To start, we began by checking for missing data in our large dataset using Python. Fortunately, we found that our dataset had no missing values. Next, we used a pairwise correlation matrix to identify highly correlated features within our dataset. Removing highly correlated features is crucial to enhance our model's performance because these features often convey similar information and may represent the same underlying relationships. In the correlation matrix, values closer to 1 indicate a strong

positive correlation, while negative values suggest no positive correlation. Values between 0.3 and 0.7 indicate moderate correlations. In our dataset, we observed that many of the correlation coefficients fell in the range of 0.3 to 0.7, signifying moderate correlations among features. This guided our feature selection process to improve the overall performance of our model. Figure 1. Correlation between features In our future work, we plan to identify and assess highly correlated features. This step is crucial because the logistic regression model operates under the assumption that its input variables are independent of each other. For instance, we've already observed that there's a correlation of 0.52 between general health and physical health. As we continue to enhance our models, it's likely that we will remove one of these correlated features to ensure the independence assumption holds. This will contribute to the robustness and accuracy of our models. Feature Selection To enhance our feature selection process, we employed CART decision trees and leveraged the power of Random Forest. Using CART, we constructed a decision tree that revealed the importance of various features in predicting diabetes risk. The tree visually showcased which factors played a significant role in the decision-making process. To further enhance model reliability and reduce overfitting, we turned to Random Forest, a technique that combines multiple decision trees to make more robust predictions. This method helped us identify and prioritize the most critical features while providing a more comprehensive view of the relationships between risk factors and diabetes presence. Together, these techniques enriched our feature selection strategy, contributing to a more informed and effective modelling approach. Logistic Regression When conducting logistic regression on our dataset, we initially used all 22 features as independent variables, with "Diabetes_binary" as the dependent variable (0 for no diabetes, 1 for diabetes) and divided the data into an 80/20 split for training and testing. The initial model achieved an accuracy of 0.86 in predicting diabetes probability. Subsequently, we employed feature selection to choose the ten most important features: BMI, Age, Income, Physical Health, General Health, Education, Mental Health, Blood Pressure, Smoking, and Fruit Consumption. It's worth noting that some of these features exhibited correlations, which we plan to address in future iterations of the model. With these ten selected features, we

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help