Shaun_McKellarJr_HW5
.docx
keyboard_arrow_up
School
Syracuse University *
*We aren’t endorsed by this school
Course
707
Subject
History
Date
Apr 3, 2024
Type
docx
Pages
16
Uploaded by shaun6
IST 707- Applied Machine Learning
HW5: Decision Trees
Use Decision Tree to Solve a Mystery in History
Shaun McKellar Jr
Introduction:
The Federalist Papers consisted of a collection of 85 essays aimed at persuading the people of New York to support the adoption of the newly proposed U.S. Constitution. These essays, authored by Alexander Hamilton, James Madison, and John Jay, were initially published anonymously under the pseudonym "Publius" in New York newspapers during 1787 and 1788. Although a bound version of the essays emerged in 1788, it wasn't until the 1818 edition, printed by Jacob Gideon, that the true authors were disclosed. The Federalist Papers hold immense significance as a key resource for interpreting the original intentions behind the Constitution.
Among these 85 essays, Alexander Hamilton is credited with writing 51, James Madison with 15,
John Jay with 5, and Hamilton and Madison collaborated on 3. However, there is ongoing debate about the authorship of the remaining 11 essays. Historians have grappled with the mystery of whether these essays can be attributed to Hamilton or Madison, a question that has persisted for many years.
About the Data
The Federalist Papers data set was used to conduct this analysis. This data set initially contained 85 rows and 72 columns. Each row referred to a paper written by one of the authors, and 70 of the columns represented a word used within the paper. The value within the cells referred to the word’s relative frequency within a particular document. The remaining two columns referred to the author’s name and the file’s name/the paper in question.
The data set contained no missing values, but data cleansing and transformation were still necessary. The columns containing the author’s name and the file’s name were not necessary for the clustering analyses but having the file’s name as the row label was necessary to identify a particular observation. However, the file names were very long, and this could decrease utility when attempting to identify observations from the clustering analyses/graphs.
n this R code, a series of data preprocessing and exploratory steps were carried out. To begin with, several R libraries were loaded, such as word cloud
, quanteda
, arules
, and ggplot2
, providing tools for text mining, data analysis, and visualization. The working directory was set to
a specific location on the desktop, ensuring that R could locate and save files. The "Federalist
Papers" dataset was loaded from a CSV file, and a backup copy called "FederalistPapers_Orig" was created to preserve the original data.
The dataset was then explored using the View
function to interactively examine its contents, and a check for missing values was conducted to ensure data completeness. To prepare the text data for analysis, thresholds for term frequency were set to filter out overly common and extremely rare words. Additionally, a list of stop words, including common English words, was defined to exclude them from the analysis.
Furthermore, a summary of the "Federalist Papers" dataset was generated to gain insights into its structure and content. Lastly, available transformations were inspected. These preprocessing steps are crucial in text mining and natural language processing projects, as they lay the foundation for meaningful analysis by addressing data quality, term frequency, and stop words to focus on relevant patterns and insights within the text data.
Model/Results
Section 1: Data Preparation
1.
Load Dataset
: The Federalist Papers dataset is loaded into FederalistPapers
.
2.
Create Subsets
: You create two subsets, FedPaper85
(excluding disputed papers) and
dispdata85
(only disputed papers).
3.
Create Training and Testing Sets
: You use createDataPartition
to split FedPaper85
into
training (
training
) and testing (
testing
) sets based on the author
column.
In this screenshot below
FedPapers85 in the figure above. "FedPapers85" contains 74 observations of 72 variables.
dispData85 data in the figure above. The results shows "11" observations of 72 variables
Model 1
Summary of Model 1
Tree 1 Call
Call:
rpart(formula = author ~ . - filename, data = training, method = "class", control = rpart.control(cp = 0))
n= 54 CP nsplit rel error xerror xstd
1 0.5555556 0 1.0000000 1.0 0.1924501
2 0.0000000 1 0.4444444 0.5 0.1521452
Variable importance
upon there on to by an 25 20 16 14 13 12 Node number 1: 54 observations, complexity param=0.5555556
predicted class=Hamilton expected loss=0.3333333 P(node) =1
class counts: 36 3 4 11
probabilities: 0.667 0.056 0.074 0.204 left son=2 (35 obs) right son=3 (19 obs)
Primary splits:
upon < 0.0195 to the right, improve=16.033140, (0 missing)
on < 0.081 to the left, improve=11.767720, (0 missing)
there < 0.014 to the right, improve=11.151940, (0 missing)
to < 0.5 to the right, improve= 9.485770, (0 missing)
of < 0.8655 to the right, improve= 7.029206, (0 missing)
Surrogate splits:
there < 0.014 to the right, agree=0.926, adj=0.789, (0 split)
on < 0.0745 to the left, agree=0.870, adj=0.632, (0 split)
to < 0.5 to the right, agree=0.852, adj=0.579, (0 split)
by < 0.1385 to the left, agree=0.833, adj=0.526, (0 split)
an < 0.064 to the right, agree=0.815, adj=0.474, (0 split)
Node number 2: 35 observations
predicted class=Hamilton expected loss=0 P(node) =0.6481481
class counts: 35 0 0 0
probabilities: 1.000 0.000 0.000 0.000 Node number 3: 19 observations
predicted class=Madison expected loss=0.4210526 P(node) =0.3518519
class counts: 1 3 4 11
probabilities: 0.053 0.158 0.211 0.579 The provided output is a summary of a decision tree model created using the rpart()
function in
R, aimed at classifying authors in the training
dataset. The model, built with 54 observations
resulted in a tree with two primary splits. The most influential variables in determining splits are
'upon', 'there', 'on', 'to', 'by', and 'an'. The root node (Node number 1) encompasses all
observations and predominantly predicts the class 'Hamilton', with subsequent splits refining
the predictions. The first child node (Node number 2) perfectly predicts 'Hamilton' for 35
observations, while the second child node (Node number 3) handles the remaining 19
observations with a mix of authors, primarily 'Madison', but with greater uncertainty in
classification (higher expected loss).
## Split the data into training and testing##
Figure 1-Model 1 Plot
Figure 2-Model 1 Plot 2
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help