Lecture Workbook Part 1 - Understanding the Problem and the Data

.xlsx

School

Kennesaw State University *

*We aren’t endorsed by this school

Course

NOT SURE

Subject

Statistics

Date

Feb 20, 2024

Type

xlsx

Pages

Uploaded by MajorDeer2792

The Problem: The Student Success Center and the Advising Department at Data University want to better understand the factors that determine a student’s GPA upon graduation. They do a great job of helping a student decide exactly what mix of classes they should take in each semester to satisfy their degree requirements, but they would like to be able to advise students on what lifestyle habits lead to student success. Of course, they tell students the generalized statements of “study hard”, “go to class”, “balance your school and social life”, etc. However, it would be helpful to know what specific habits have the greatest impact on a student’s GPA and be able to provide them with more targeted recommendations. It would also be helpful to be able to predict the expected GPA of a particular student based on their current habits. Step 1: The first step in any data science or analytics problem is to understand the problem. This step often requires asking lots of questions and really getting to the root of the problem. We are limited to the information we have above, however it does provide us with a relatively good understanding of the problem. 1. What are the pain points? What will they do once they have the specific habit? Who is the client? What so they need to achompolish their goal? What do they need? What do they want to do? What relationships do the varbiles have to the GPA? What relationship do they have with each other? to predict the expected GPA of a particular student based on their current habits. 2. What sort of outcome are you or the client looking for? What habits contribute towards GPA Wants to provide better advice What defines success and what sort of analysis would add value? Improvement in overall GPA GPA Average imporved by .2 To be completeed by 12 weeks Step 2: Next, we need to use the information provided to formulate a problem statement that can be solved with data. The key components of a good problem statement are that it is clear, concise, and measurable.

problem statement are that it is clear, concise, and measurable. 1. Clear meaning that it is easily understood and not ambiguous. 2. Concise meaning that is it no longer than it absolutely needs to be... it is straight to the point (*this problem statement should only be 1-2 sentences). 3. Measurable meaning that it can be measured and is actionable. Sample Problem Statement: “We will perform an explanatory analysis to determine the most impactful habits that lead to student success and build a model to predict a student’s expected GPA, at graduation, based on student habits, at the time of advisement.” “We will perform an explanatory analysis to determine the most impactful habits that lead to student success and build a model to predict a student’s expected GPA, at graduation, based on student habits, at the time of advisement.” + Timeframe + Improve average GPA by .2

Collecting the Data: After we have a good understanding of the problem and have developed a well-defined problem statement, we need to collect the data. If you are working for or consulting with a company, you will figure out what private data they have available for you to use. After that you may want to brainstorm and search for any publicly available data that might be helpful in your analysis. That said, there may be no public data that is relevant to your specific problem, but it is important to consider. After you have a pass at the analysis using the data that is readily available, you may need to collect new data and different data on the observational units to create a more robust and meaningful analysis. For our problem, the advising department has provided us with a small private data set to get started. It contains observations on 133 recent graduate students from Data University that were enrolled as full-time students. The variables include: 1. Student ID: A deidentified student ID number that cannot be linked to the student for ethical reasons 2. Sex: The student’s biological sex 3. Sleep: The average number of hours the student sleeps per night 4. Alcohol: The average number of alcoholic beverages a student consumes per week 5. Exercise: The average number of hours a student exercises per week 6. TV: The average number of hours of TV a student watches per week 7. Study: The average number of hours a student studies per week 8. Seat: The general location a student typically sits in the classroom (1-Front, 2-Middle, 3-Back) 9. GPA: The students grade point average upon graduation Defining the Dependent (Target/Response) Variable: After we obtain the data, we need to define our target variable of interest. (The target is generally determine before the data collection process, but may happen after seeing what data is readily available). Other names for the target variable are the “response variable” or the “dependent variable.” Target, Response and Dependent can be used interchangeably. What is the variable we are trying to better understand? If we are building an explanatory model, what variable are we seeking to better explain? If we are building a predictive model, what variable are we seeking to predict? 1. Which variable from the list above would be our target? GPA

Listing the Independent (Explanatory/Predictor) Variables: The other variables in the data set can be referred to as “explanatory variables” if we are using them to explain how the target responds to changes in them, “predictor variables” if we are using them to predict the target, or “independent variables.” 2. List the explanatory or predictor variables in the dataset: (Hint: There are only 7 independent analysis variables, because one of the variables above would not make sense to analyze): The objective of the first part of the analysis is to understand the variables we have and how (or if) the independent variables are related to the dependent variable. Exploring the Dataframe Exercise : Look at the worksheet “0 – Raw Data” to answer the questions 1. How many variables are in the data set? This should correspond to the number of columns 2. How many people (observational units) did the study collect data on? This should correspond to the number of rows in our dataset. 3. What data point is found in cell D17? What does it represent?

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version