LeighHutsell_DAT520_FinalProject

docx

School

Southern New Hampshire University *

*We aren’t endorsed by this school

Course

520

Subject

Business

Date

Jan 9, 2024

Type

docx

Pages

16

Uploaded by MegaClover8102

Report
1 Hitting vs Pitching: Which is More Important? Leigh Hutsell DAT-520 Decision Methods and Modeling Final Project – Module 9
2 Table of Contents Introduction 3 Data Appraisal 4 Data Appraisal: Characterize 4 Data Appraisal: Context 5 Data Appraisal: Utilities 5 Techniques 5 Techniques: Preparing 5 Techniques: Type of Model 6 Techniques: Explain 7 Evaluation 8 Evaluate Choices: Best 8 Evaluate Choices: Agility 8 Evaluate Choices: Address Concerns 8 Models 9 Decision Tree Model: Implementation 9 Decision Tree Model: Structure 9 Decision Tree Model: Documentation 9 Decision Tree Model: Results 10 Decision Tree Model: Limitations 14 Conclusion 15 References 16
3 Introduction I will be looking to answer the question of which is more important to the success of a baseball team, hitting or pitching? Baseball is comprised of two leagues, the National League and the American League. Each game is comprised of 9 innings where each team takes turns batting and pitching. In the event of a tie at the end of 9 innings the teams will go into extra innings until one of the teams is winning at the end of the inning. Where these two leagues are similar, they do have some rules that differ between them. At the end of the season the top team from each league will compete in the World Series. The two areas of baseball that we are going to investigate are Batting and Pitching. Before we get to the data, I would like to define the variables that we will be looking at. Regarding batting we will look at the following: Batting Average (BA) At Bats (AB) Runs (R) Hits (H) Home Runs (HR) Runs Batted In (RBI) Earned Run Average (ERA) League (LgID) Team Wins and Losses Regarding pitching we will look at similar variables such as: Wins (W)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4 Losses (L) Saves (SV) Strike Outs (SO) Earned Run Average (ERA) Runs Allowed (R) Games (G) Earned Runs (ER) League (LgID) I originally wanted to look at the different variables within both pitching and hitting but after further analysis this is simply too much data and needs to be pared down further. For batting I originally looked at Batting Average (BA), At Bats (AB), Runs (R), Hits (H), Home Runs (HR), Runs Batted In (RBI), Earned Run Average (ERA), League (LgID), and Team Wins (W) and Losses (L). For Pitching I originally looked at Wins (W), Losses (L), Saves (SV), Strike Outs (SO), Earned Run Average (ERA), Runs Allowed (R), Games (G), Earned Runs (ER), and League (LgID). Data Appraisal: Characterize The data that will be used in this analysis will be taken from Sean Lahman. This data contains Major League Baseball data sets that range from 1871 to 2021. It is not one singular set of data but rather multiple sets of data that will need to be analyzed to determine which data sets will be most effective in answering the question at hand. Where there are other sites that offer additional data, Lahman’s files have the data that is needed.
5 Data Appraisal: Context Data Context is, “the network of connections among data points. Those connections may be created as metadata or simply identified and correlated. Contextual metadata adds value, essentially making it possible to receive information from data. A single data on its own is useless,” (Wigmore, 2023). By Isolating Earned Run Average (ERA) I would not be able to effectively analyze the data to answer the question at hand. I must utilize multiple data points that address both pitching and hitting, offering enough data to answer the question. The file that I found most relevant was “Teams.csv” as it had all the data points needed. Key data points needed were ERA, BA, W, and L. Data Appraisal: Utilities I opted to pull the CSV data from the website. CSV is Comma Separated Values. CSV files are, “simple text files with rows of data where each value is separated by a comma. These types of files can contain large amounts of data within a relatively small file size, making them an ideal data source for Power BI,” (Get Data, 2023). I used this data in Power BI because I was able to import and analyze the data easily. The data set contained several factors relating to pitching. I was able to look at those factors and determine which factors were more relevant to answer the question at hand. By correctly analyzing this data I was able to determine which factor was most important to winning. Techniques: Preparing The analytics process that I followed to build my model is as follows: In week one I identified the question that I wanted to answer. Which is more important to the success of a baseball team, hitting or pitching?
6 The next step was to source the data. For this we were given the data sets referenced above from Sean Lahman. After these files were downloaded, I went through the data sources and identified the information that was relevant to the question above. The data was then condensed into a file that eliminated excess and invalid data. I used Power BI to conduct exploratory data analysis (EDA). This gave me the opportunity to see the data that I had deemed relevant and visualize the data to better understand the data. With the data that was provided this was a very important step as Lahman provided a large amount of data. The next step was to build data models and test the data. The last step was to monitor and validate that the data that is presented answers the question and meets objectives. Techniques: Type of Model As noted above, I was able to utilize power BI to analyze the data. I originally used the top- down method to look at the big pictures and the components of hitting and pitching to identify the end goal which is to determine whether hitting or pitching is more important to the success of a team. I have opted to use the data set Teams.csv that was provided from Lahman’s Baseball Database as it contains the pertinent data needed about the teams as well as Hitting and Pitching stats. The variables that will be used in the revision will be for Teams: Wins, Loss, and World Series Championships. For Batters: Runs Scored, Hits, Home Runs, and Walks. For Pitchers: Runs Allowed, Hits Allowed, Home Runs Allowed, Strike Outs and Walks Allowed. The variables will be weighed using wins as the main target. In reworking
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
7 my data and changing my focus I was able to determine that hitting has more of an effect on winning than pitching. Techniques: Explain The data set that was provided for this class is free to access and available to anyone who can get on a computer. Where this data is copyrighted, it is still available for public consumption and use. I defined the question, collected the data from Lahman’s data, cleaned the data to identify which data would be needed, analyzed the data and created visualizations to show my findings. The data that is present in the data set is important to answering the question of whether pitching or hitting makes a winning team. The data sets include batting statistics and pitching statistics. That data is not enough on its own to determine the answer that is needed. The data also looks at age, awards, all stars, hall of fame, salary, managers, team information, and franchise information. There are a lot of things that make up a team. Some could argue that money makes the best teams where others say it is management or defense. From all my years of watching baseball I know that to find this answer we will need all this data. One distinct difference in baseball is the two leagues. You have American League and National League. This is important in our research because prior to 2022 the American League pitchers were not allowed to bat, and you had to have a designated hitter that would bat for the pitcher whereas in the National League the pitchers did bat. In 2022 they implemented the universal designated hitter, and no pitchers will bat bringing a sense of balance between the two leagues. The datasets provided for this project all have a relationship to pitching and catching and will all be used to answer the question at hand.
8 Evaluate Choices: Best The foundation of this question is to make a decision. We are deciding whether pitching or hitting is more important to a winning team. “Decision trees allow you to create easily interpretable outcomes and pick the best possible solution,” (Introduction to Decision Trees, 2023). The benefits of a Decision Tree are that they are easy to understand, offer little to no need for data preprocessing, and versatile. Where there are disadvantages to Decision Trees, in this scenario the advantages outweigh them. Evaluate Choices: Agility This type of analysis will allow for agility by allowing for clear and confident decision making. In looking at MLB we are also looking at National (NL) and American Leagues (AL) which do offer differences. Separating the two leagues is important to understanding the relationship between hitting and pitching and how they play a role in wins. One of the main differences in the leagues is that where the NL pitchers must bat, the AL pitchers do not. AL utilizes designated hitters to bat for their pitchers. Evaluate Choices: Address Concerns One of the main concerns is that the data provided offers a large amount of data. As a whole, it is not feasible to use it all. When building a Decision Tree, it is possible to have too many data points. Not only could it make it too complex, but it could also create a greater chance of error. In Lahman’s data, there are many years of data available to analyze. For the Decision Tree to be effective we need to cut down the number of years that we are focusing on. On the other hand, it is possible to cut the data down to not offer enough information. Identifying the correct amount and key data points allows for the Decision Tree Model to be effective.
9 Decision Tree Model: Implementation I originally used the top-down method to look at the big pictures and the components of hitting and pitching to identify the end goal which is to determine whether hitting or pitching is more important to the success of a team. I have opted to use the data set Teams.csv that was provided from Lahman’s Baseball Database as it contains the pertinent data needed about the teams as well as Hitting and Pitching stats. The variables that will be used in the revision will be for Teams: Wins, Loss, and World Series Championships. For Batters: Runs Scored, Hits, Home Runs, and Walks. For Pitchers: Runs Allowed, Hits Allowed, Home Runs Allowed, Strike Outs and Walks Allowed. The variables will be weighed using wins as the main target. Decision Tree Model: Structure For the Decision Tree Model, I had to identify the Decision Node and the Leaf Nodes. The Decision Node is the root. “It represents the entire dataset, which is further divided into 2 or more homogeneous sets. The decision nodes represent the dataset’s features, branches denote the decision rules, and each leaf signifies the outcome,” (Introduction to Decision Trees, 2023). In reworking my data and changing my focus I was able to determine that hitting has more of an effect on winning than pitching. Decision Tree Model: Documentation With my starting data I focused on Wins as the Decision Node and then again with Losses as the Decision Node. Focusing on Hits allowed me the data that I needed to analyze and answer the question. The other area, as mentioned above, was to home in on more recent years eliminating a large amount of data making it easier to analyze. Input variables included Hits and ERA. These offer insight into both hitting and pitching. Looking at both winning and losing was
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
10 important. If I were to have only focused on one side, I would not be able to confidently say that one or the other has a greater impact on winning. Decision Tree Model: Results Below is a series of results that I received throughout the process. I was able to create decomposition trees for both pitching and hitting that will help me to answer my question. Bottom-Up Decision modeling is, "the bottom-up approach focuses its analysis on specific characteristics and micro attributes of an individual stock. In bottom-up the concentration is on business-by-business or sector-by-sector fundamentals. This analysis seeks to identify profitable opportunities through the idiosyncrasies of a company’s attributes and its valuations
11 in comparison to the market," (Investopedia, 2023). On the other side the Top-Down approach, "seeks to identify the big picture and all of its components. These components are usually the driving force for the end goal," (Investopedia, 2023). In looking at the definitions of top-down and bottom-up I will be using the top-down method as I will be looking at the big pictures and the components of hitting and pitching to identify the end goal which is to determine whether hitting or pitching is more important to the success of a team. In reworking my data and changing my focus I was able to determine that hitting has more of an effect on winning than pitching. In looking at the data from 2015 – 2022 it was determined that when the sum of Hits is more than 1553 then the likelihood of a World Series Win being Yes increases by 3.81x.
12 On the opposite side looking at the no, we see that:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
13
14 Decision Tree Model: Limitations The decision tree by itself was not what I needed to draw a successful conclusion. As you can see in the above section, I was able to take the data and analyze it through other functions of Power BI to validate that my conclusion was correct. Decision Trees are not good for regression. “Logical regression is a statistical analysis approach that uses independent features to try to predict precise probability outcomes. On high-dimensional datasets, this may cause the model to be over-fit on the training set, overstating the accuracy of predictions on the training set, and so preventing the model from accurately predicting results on the test set,” (Decision Tree Limitations, 2023). Decision Trees can also be unstable meaning that with a slight change in the data a completely different result can be formed. This is why I opted top add a secondary analysis to confirm.
15 Conclusion Where both pitching and hitting are important to winning, the most vital to a team’s success is the number of hits accumulated. The factors revolving around hitting can have a much higher significance on the success of a team. Where a pitcher could have a game with high stats in strikeouts and low walks, the hitting from the other team can be more impactful than that of the pitching.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
16 References Decision Tree Limitations. (2023, March 13). Retrieved September 3, 2023, from https://www.educba.com/decision-tree-limitations/ Get data from comma separated value (CSV) files. (2023, January 26). Retrieved August 28, 2023 from https://learn.microsoft.com/en-us/power-bi/connect-data/service-comma-separated- value-files Introduction to Decision Trees: Why Should You Use Them? (2023, April 12). Retrieved August 29, 2023 from https://365datascience.com/tutorials/machine-learning-tutorials/decision-trees/ Investopedia (2023, April 18).  Top-Down vs. Bottom-Up: What's the Difference?  Retrieved August 3, 2023, from  https://www.investopedia.com/articles/investing/030116/topdown-vs- bottomup.asp   Lahman, S. (n.d.). Download Lahman’s Baseball Database. Retrieved August 28, 2023, from http://www.seanlahman.com/baseball-archive/statistics/   Wigmore, I. (n.d.) Data Context . Retrieved August 28, 2023, from https://www.techtarget.com/whatis/definition/data-context