Shaun_McKellarJr_HW5.docx

IST 707- Applied Machine Learning HW5: Decision Trees Use Decision Tree to Solve a Mystery in History Shaun McKellar Jr Introduction: The Federalist Papers consisted of a collection of 85 essays aimed at persuading the people of New York to support the adoption of the newly proposed U.S. Constitution. These essays, authored by Alexander Hamilton, James Madison, and John Jay, were initially published anonymously under the pseudonym "Publius" in New York newspapers during 1787 and 1788. Although a bound version of the essays emerged in 1788, it wasn't until the 1818 edition, printed by Jacob Gideon, that the true authors were disclosed. The Federalist Papers hold immense significance as a key resource for interpreting the original intentions behind the Constitution. Among these 85 essays, Alexander Hamilton is credited with writing 51, James Madison with 15, John Jay with 5, and Hamilton and Madison collaborated on 3. However, there is ongoing debate about the authorship of the remaining 11 essays. Historians have grappled with the mystery of whether these essays can be attributed to Hamilton or Madison, a question that has persisted for many years. About the Data The Federalist Papers data set was used to conduct this analysis. This data set initially contained 85 rows and 72 columns. Each row referred to a paper written by one of the authors, and 70 of the columns represented a word used within the paper. The value within the cells referred to the word’s relative frequency within a particular document. The remaining two columns referred to the author’s name and the file’s name/the paper in question. The data set contained no missing values, but data cleansing and transformation were still necessary. The columns containing the author’s name and the file’s name were not necessary for the clustering analyses but having the file’s name as the row label was necessary to identify a particular observation. However, the file names were very long, and this could decrease utility when attempting to identify observations from the clustering analyses/graphs. n this R code, a series of data preprocessing and exploratory steps were carried out. To begin with, several R libraries were loaded, such as word cloud , quanteda , arules , and ggplot2 , providing tools for text mining, data analysis, and visualization. The working directory was set to a specific location on the desktop, ensuring that R could locate and save files. The "Federalist

Papers" dataset was loaded from a CSV file, and a backup copy called "FederalistPapers_Orig" was created to preserve the original data. The dataset was then explored using the View function to interactively examine its contents, and a check for missing values was conducted to ensure data completeness. To prepare the text data for analysis, thresholds for term frequency were set to filter out overly common and extremely rare words. Additionally, a list of stop words, including common English words, was defined to exclude them from the analysis. Furthermore, a summary of the "Federalist Papers" dataset was generated to gain insights into its structure and content. Lastly, available transformations were inspected. These preprocessing steps are crucial in text mining and natural language processing projects, as they lay the foundation for meaningful analysis by addressing data quality, term frequency, and stop words to focus on relevant patterns and insights within the text data. Model/Results Section 1: Data Preparation 1. Load Dataset : The Federalist Papers dataset is loaded into FederalistPapers . 2. Create Subsets : You create two subsets, FedPaper85 (excluding disputed papers) and dispdata85 (only disputed papers). 3. Create Training and Testing Sets : You use createDataPartition to split FedPaper85 into training ( training ) and testing ( testing ) sets based on the author column. In this screenshot below

FedPapers85 in the figure above. "FedPapers85" contains 74 observations of 72 variables. dispData85 data in the figure above. The results shows "11" observations of 72 variables Model 1 Summary of Model 1 Tree 1 Call Call:

rpart(formula = author ~ . - filename, data = training, method = "class", control = rpart.control(cp = 0)) n= 54 CP nsplit rel error xerror xstd 1 0.5555556 0 1.0000000 1.0 0.1924501 2 0.0000000 1 0.4444444 0.5 0.1521452 Variable importance upon there on to by an 25 20 16 14 13 12 Node number 1: 54 observations, complexity param=0.5555556 predicted class=Hamilton expected loss=0.3333333 P(node) =1 class counts: 36 3 4 11 probabilities: 0.667 0.056 0.074 0.204 left son=2 (35 obs) right son=3 (19 obs) Primary splits: upon < 0.0195 to the right, improve=16.033140, (0 missing) on < 0.081 to the left, improve=11.767720, (0 missing) there < 0.014 to the right, improve=11.151940, (0 missing) to < 0.5 to the right, improve= 9.485770, (0 missing) of < 0.8655 to the right, improve= 7.029206, (0 missing) Surrogate splits: there < 0.014 to the right, agree=0.926, adj=0.789, (0 split) on < 0.0745 to the left, agree=0.870, adj=0.632, (0 split) to < 0.5 to the right, agree=0.852, adj=0.579, (0 split) by < 0.1385 to the left, agree=0.833, adj=0.526, (0 split) an < 0.064 to the right, agree=0.815, adj=0.474, (0 split) Node number 2: 35 observations predicted class=Hamilton expected loss=0 P(node) =0.6481481 class counts: 35 0 0 0 probabilities: 1.000 0.000 0.000 0.000 Node number 3: 19 observations predicted class=Madison expected loss=0.4210526 P(node) =0.3518519 class counts: 1 3 4 11 probabilities: 0.053 0.158 0.211 0.579 The provided output is a summary of a decision tree model created using the rpart() function in R, aimed at classifying authors in the training dataset. The model, built with 54 observations resulted in a tree with two primary splits. The most influential variables in determining splits are 'upon', 'there', 'on', 'to', 'by', and 'an'. The root node (Node number 1) encompasses all observations and predominantly predicts the class 'Hamilton', with subsequent splits refining the predictions. The first child node (Node number 2) perfectly predicts 'Hamilton' for 35 observations, while the second child node (Node number 3) handles the remaining 19 observations with a mix of authors, primarily 'Madison', but with greater uncertainty in classification (higher expected loss). ## Split the data into training and testing##

Figure 1-Model 1 Plot

Figure 2-Model 1 Plot 2

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help