D208_Task2_PA

.docx

School

Western Governors University *

*We aren’t endorsed by this school

Course

D208

Subject

Statistics

Date

Feb 20, 2024

Type

docx

Pages

24

Uploaded by MagistrateAntelope3113

D208 - Predictive Modeling Logistic Regression Modeling
Table of Contents Part I: Research Question ......................................................................................................... 3 A. Describe Purpose of Analysis ......................................................................................................... 3 1. Summarize one research question .................................................................................................................... 3 2. Define Goals of Analysis .................................................................................................................................... 3 Part II: Method Justification ..................................................................................................... 3 B. Describe Multiple Logistic Regression Methods ............................................................................. 3 1. Summarize four assumptions of a logistic regression model ............................................................................ 3 2. Describe two benefits of using Python in support of analysis .......................................................................... 3 3. Explain why logistic regression an appropriate technique is to use based on question in Part I ..................... 3 Part III: Data Prep .................................................................................................................... 3 C. Summarize the data prep process .................................................................................................. 3 1. Describe Data cleaning goals – See attached .ipynb File ................................................................................... 3 2. Describe dependent and all independent variables ......................................................................................... 5 3. Generate univariate and bivariate visualizations of the distributions – independent and dependent variables, include dependent variable in bivariate visualization .......................................................................... 7 4. Describe data transformation goals that algin with your research question and the steps used to transform the data to achieve goals, include annotated code ............................................................................................ 13 5. Provide Prepared set as a CSV file ................................................................................................................... 14 Part IV: Model Comparison & Analysis ................................................................................... 15 D. Compare initial and reduced linear regression model .................................................................. 15 1. Initial multiple linear regression model with all variables from part C2 ......................................................... 15 2. Justify statistically based feature selection ..................................................................................................... 16 3. Provide reduced linear regression model ....................................................................................................... 19 E ....................................................................................................................................................... 20 1. Model Evaluation Mettric explanation ............................................................................................................ 20 2. Confusion Matrix & Accuracy Calculation ....................................................................................................... 21 3. Attached code .............................................................................................................................. 22 F. Summary ...................................................................................................................................... 23 1. Discuss results ................................................................................................................................................. 23 2. Recommend course of action .......................................................................................................................... 24 G ...................................................................................................................................................... 24 H ...................................................................................................................................................... 24 I ........................................................................................................................................................ 24
Part I: Research Question A. Describe Purpose of Analysis 1. Summarize one research question What factors contribute to Churn? 2. Define Goals of Analysis The objective of my analysis is to gain insight into what customer factors directly correlate to whether or not a customer Churns. Part II: Method Justification B. Describe Multiple Logistic Regression Methods 1. Summarize four assumptions of a logistic regression model Assumptions for this model include: There is independence of observations, the outcome of one observation should not influence what happens in another observation. There is nominal independence that the independent variables don’t correlate highly with each other. A goodness of fit test should be used to evaluate how well the model fits our data. The independent variables and log-fits should be linear. 2. Describe two benefits of using Python in support of analysis Jupyter notebook and Python are the tools I used to complete this analysis. Using Python as my method of analysis is beneficial for many reasons. But I will only list 2. The first benefit is that Python offers multiple libraries of data visualization that I can use to help me visualize my logistic regression models. The second benefit is that it has a rich ecosystem of libraries which means that it has many of the calculations already built out which can save time in the analysis phase. These both mean that I can calculate and visualize my data with ease using Python. 3. Explain why logistic regression an appropriate technique is to use based on question in Part I Our target variable, Churn is a binary, categorical field. Logistic regression will help identify the elements that influence it. Therefore, logistic regression is an excellent technique to assist me in answering my question in Part I. We will test independent variables to determine the affect they have on our target variable. The affect could be positive, negative, or none. Part III: Data Prep C. Summarize the data prep process 1. Describe Data cleaning goals – See attached .ipynb File While becoming familiar with the data, by using .describe(), box plots, and .isnull() sums, I was able to identify areas that needed to be cleaned in the data. The goal is to have a data environment that is optimal to perform a linear regression analysis.
I first began by identifying that there were null fields in InternetService. Using similar techniques that I used in D206 I filled those nan fields by using the .fillna() method. It’s an easy way to fill null fields without compromising the data or removing excessive rows of data from the data source. Once the nulls were taken care of, I was able to identify outliers in the data by using box plots. It did appear that there are multiple fields with outliers: Income, Children, and Outage_sec_perweek. I determined to identify the count of outliers, and then replace them if they were several standard deviations above the mean. The threshold set was 3. The outliers outside of three standard deviations were replaced using z-scores. With the null values and outliers taken care of, the data cleaning step was complete.
2. Describe dependent and all independent variables The original data set contains 50 columns and 10,000 rows of customers. For my analysis I will be focusing on ‘Churn’ as my dependent variable. I will also retain a summary describing ‘Income’, ‘Outage_sec_perweek’, ‘Tenure’, ‘MonthlyCharge’, and ‘Bandwidth_GB_Year’. These are my nominal values. The categorical values that will be retained for independent variables are ‘Area’, ‘Contract’, and ‘PaymentMethod’. The dependent variable, Churn is binary of ‘Yes’ or ‘No’. ‘Yes’ indicates that the customer churned as a customer and ‘No indicates that they did not. ‘No’ is the top response with 7350 and 2650 were ‘Yes’. The independent variables that are nominal or continuous are Income which has 10000 rows, a minimum stated income of $348 and maximum of $258900. The average income per year for the customers is $39806. Outage_sec_perweek is also a nominal value with the average of 10 seconds per week, max of 21 and minimum of .09 seconds per week. Tenure is the lengt of time a customer has been a customer with the average of 34 months with the provider and max of 71 months. Bandwidth_GB_Year is a nominal independent variable and has an average of 3392 GB per year, maximum of 7158 GB per year and minimum of 155 GB per year. MonthlyCharge is the final nominal independent variable and has an average of $172.62, the maximum payment per month is $290.16 and minimum is $79.98. There are 4 categorical independent variables, and they are Area, which is the type of area a customer lives in. The top area for our customers is Suburban with 33.46% of customers, 33.27% Urban and 33.27% Rural. Contract is the type of contract our customers are on. The most frequent contract is month-to-month with 54.56% of customers, Two Year at 24.42% of customers and 21.02% customers at One year. Finally, the type of payment method is the last independent categorical value. The top payment method is Electronic Check with 33.98% of customers utilizing e-checks, 22.9% mailing in checks, 22.29% having a bank automatic transfer and 20.83% having automatic credit card transactions. Churn is the final categorical of Yes or No. And 73.5% of the 10,000 rows are still customers.
Using this data to describe our typical customer would mean that we have a customer who likely has a yearly income of $39806, an average of 10 seconds of internet interruption due to outage per week, has been with us as a provider for around 34 months and uses 3392 GB of data per year. They also likely live in a suburban area on a month-to-month contract and pays electronic check for $173.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help