Homework_Assignment_1 - 5
.docx
keyboard_arrow_up
School
Pennsylvania State University *
*We aren’t endorsed by this school
Course
200
Subject
Computer Science
Date
Dec 6, 2023
Type
docx
Pages
9
Uploaded by GeneralSummer13484
Homework Assignment 1
DS200: Introduction to Data Sciences
2022 fall
Please complete this assignment by entering your answers in this document. You can submit
this Word document or a PDF export on Canvas.
Problem 1: Sampling [0.5 points]
Suppose that you work at a hospital, and you have to recruit participants for a medical study to
test a new heart disease medication. Match the three examples below to the three sampling
approaches.
Examples
:
1.
You try to recruit the 100 patients with the highest blood pressure.
2.
You order all patients by age and try to recruit every 500
th
patient, starting with a
randomly chosen patient from the first 500.
3.
You store patient identifiers in an array called patients
, apply numpy.random.
choice (patients, 100)
, and try to recruit the patients returned by this
function.
Sampling approaches
:
deterministic
systematic random
simple random
Answer:
1: deterministic sampling
2: systematic random sampling
3: simple random sampling
Problem 2: Distribution [1 point]
Consider the following distribution of values: Identify the following:
median
outlier
1
st
quartile
95% percentile
Answer:
a: 1
st quartile
b: median
c: 95% percentile
d: outlier
Problem 3: Association [1 point]
Consider the following three scatter plots: What kind of association can you observe between X and Y in each figure? Explain your answer.
Hint: Possible kinds of association are:
positive association
negative association
no association
Answer:
First figure: negative association
Second figure: positive association
Third figure: no association
Problem 4: Causality [1 point]
Suppose that a positive association was observed between the following three variables:
number of tooth cavities,
ounces of sugary drinks consumed,
weight in pounds.
Which of these three variables might be a confounding variable? What spurious conclusion may
it cause? Motivate your answer.
Answer:
Though the correlation between the number of tooth cavities and the weight
in pounds may be positive, that does not mean that higher weight makes a
greater number of cavities or that a greater number of tooth cavities makes
higher. Ounces of sugary drinks consumed is a
confound
ing variable:
drinking more ounces of sugary drinks makes both heavier and a greater
number of teeth, on average.
Even if there is a correlation for the two groups (number of cavities and
weight) that do not meet the conditions of the confounding variable, it is
considered highly likely to be a coincidence.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Questions
LINKS :
https://youtu.be/rR_mmsfIzzs
Input for games - UWP applications | Microsoft Learn
arrow_forward
Python Coding Question
Use the sacramento.csv file to complete the following assignment. Create a file, sacramento.py, that loads the .csv file and runs a logistic regression. The regression should predict whether or not a house has 1 or more than one bathroom based on beds, sqft, and price, in that order. Note: you will not need to upload the .csv to CodeGrade because I have pre-loaded it.
You will need to create a new variable from baths, and it should make it such that those observations of 1 bath correspond to a value of 0, and those with more than 1 bath correspond to a 1.
Make sure to add a constant using sm.add_constant(X)
Your file should print the results in this way:
print(mod.params.round(2))
print(mod.pvalues.round(2))
print('The smallest p-value is for sqft')
arrow_forward
In the attached file, you will find the oil production for all
countries that produce more than 1Mbpd, use pie chart and bar
chart to show the percentage of production for each of them.
country
bpd
United States
11567000
Russia
10503000
Saudi Arabia
10225000
Canada
4656000
Iraq
4260000
China
3969000
United Arab
Emirates
2954000
Brazil
2852000
Kuwait
2610000
Iran
2546000
Kazakhstan
1937000
Norway
1744000
Mexico
1733000
Qatar
1297000
Nigeria
Libya
Angola
1258000
1220000
1158000
arrow_forward
TREE PROJECT
There is a real program developed by a computer company that reads a report (running text ) and issues
warnings on style and partially correct bad style. You are to write a simplified version of this program
with the following features:
Statistics
A statistical summary with the following information:
Total number of words in the report
Number of unique words
Number of unique words of more than three letters
Average word length
Average sentence length
An index (alphabetical listing) of all the unique words (see next page for a specific format)
Style Warnings
Decrease Font Size
Issue a warning in the following cases:
Word used too often: list each unique word of more than three letters if its usage is more than 5%
of the total number of words of more than three letters
Sentence length : write a warning message if the average sentence length is greater than 10
Word length : write a warning message if the average word length is greater than 5
Input
From the keyboard: The name…
arrow_forward
Please answer in matlab code.
Download the data file AtlanticHurricanes20012020.csv, read in Matlab,and assign to the array hurrData:hurrData = readmatrix('AtlanticHurricanes20012020.csv');
Create a histogram plot showing the number of Hurricanes per year Label the x-axis Number of Hurricanes/year Label the y-axis Frequency Title the plot Hurricane Frequency Distribution 2001-2020 Save the figure as an emf file
Create a bar plot showing annual hurricaines occurence Set the x = to the year; y = number of hurricanes Label the x-axis Year Label the y-axis Number of Hurricanes Title the plot Annual Hurricane Occurrence 2001-2020 Save the figure as an emf file.
Create a line plot showing annual hurricaines occurence Set the x = to the year; y = number of hurricanes. The curve should be a red line with square symbols. Label the x-axis Year Label the y-axis Number of Hurricanes Title the plot Annual Hurricane Occurrence 2001-2020 Save the figure as an emf file.
Plot the histogram, the bar…
arrow_forward
Create a new workbook as shown below and save the file with the name "Call Statistics".
1
Panda EST
Monthly Sales Report - July
2
3
Sales Amount
1600
1800
Total Salary
4 Emp. No. Name
5 S101
6 S105
7 S112
8 s107
9 S110
Salary
Comission
2500 ?
3000
Ahmed
Hassan
Ali
1500
2200
Waleed
Mohammed
Samir
4500
3500
2000
1700
10 s103
1600
2500
11
Totals
Average
Highest
Lowest
Count
12
13
12
14
15
16
a) Create the worksheet shown above.
b) Set the column widths as follows: Column A: 8, Column B: 14, Columns C & D: 15, Columns E
& F: 14.
c) Enter the formula to find COMMISSION for the first employee. The commission rate is 2% of
sales, COMMISSION = SALES * 2% Copy the formula to the remaining employees.
d) Enter the formula to find TOTAL SALARY for the first employee where: TOTAL SALARY =
SALARY + COMMISSION Copy the formula to the remaining employees.
e) Enter formula to find TOTALS, AVERAGE, HIGHEST, LOWEST, and COUNT values. Copy
the formula to each column.
f) Format numeric data to include…
arrow_forward
Horizontal sequence :VIRL
Vertical sequence:MKF
Scoring rules: g/o = -3, g/e = -1, match or mismatch - from PAM250 substitution matrix below.
SW algorithm.
1. Complete the scoring matrix.
Scoring matrix with PAM250 scores:
V
I
R
L
M
K
F
2. Set up, initialize and complete the SW matrix.
3. Retrace, align and score alignment(s).
Use the arrows and circles for the matrix and path(s).
V
I
R
L
M
K
F
Align and score all optimal alignments here.
PLZ the arrows and circles for the matrix and path(s) AND SHOW ALL possible Alignment
arrow_forward
Horizontal sequence :VIRL
Vertical sequence:MKF
Scoring rules: g/o = -3, g/e = -1, match or mismatch - from PAM250 substitution matrix below.
NW algorithm.
1. Complete the scoring matrix.
Scoring matrix with PAM250 scores:
V
I
R
L
M
K
F
2. Set up, initialize and complete the NW matrix.
3. Retrace, align and score alignment(s).
Use the arrows and circles for the matrix and path(s).
V
I
R
L
M
K
F
Align and score all optimal alignments here.
PLZ the arrows and circles for the matrix and path(s) AND SHOW ALL possible Alignment
arrow_forward
Horizontal sequence :VIRL
Vertical sequence:MKF
Scoring rules: g/o = -3, g/e = -1, match or mismatch - from PAM250 substitution matrix below.
NW algorithm.
1. Complete the scoring matrix.
Scoring matrix with PAM250 scores:
V
I
R
L
M
K
F
2. Set up, initialize and complete the NW matrix.
3. Retrace, align and score alignment(s).
Use the arrows and circles for the matrix and path(s).
V
I
R
L
M
K
F
Align and score all optimal alignments here.
arrow_forward
Can somebody help me with my homework? I provided screenshots of my code that goes alongside the question.
*Side note comments would be helpful, but not required.
arrow_forward
Horizontal sequence :RIVL
Vertical sequence:FMK
Scoring rules: g/o = -3, g/e = -1, match or mismatch - from PAM250 substitution matrix below.
SW algorithm.
1. Complete the scoring matrix.
Scoring matrix with PAM250 scores:
R
I
V
L
F
M
K
2. Set up, initialize and complete the SW matrix.
3. Retrace, align and score alignment(s).
Use the arrows and circles for the matrix and path(s).
R
I
V
L
F
M
K
Align and score all optimal alignments here.
PLZ the arrows and circles for the matrix and path(s) AND SHOW ALL possible Alignment
arrow_forward
Open the Excel file Student_Excel_Intro_Cap1_Year_End_Report.xlsx downloaded with this project.
On the Net Sales worksheet, calculate totals in the ranges F4:F8 and B9:F9. Apply the Total cell style to the range B9:F9.
Using absolute cell references as necessary, in cell G4, construct a formula to calculate the percent that the Colorado Total is of Total Sales, and then apply Percent Style with zero decimals. Fill the formula down through the range G5:G8.
In the range H4:H8, insert Line sparklines to represent the trend of each state across the four quarters. Do not include the totals. Display Markers.
Select the range A3:E8, and then use the Recommended Charts command to suggest an appropriate chart. Click the first Clustered Column chart that uses the state names as the category axis. Align the upper left corner of the chart inside the upper left corner of cell A11, and then size the chart so that its lower right corner is slightly inside cell H24. Apply chart Style…
arrow_forward
This exercise allows a user to load one of two CSV files and then perform histogram analysis and plots for select variables on the datasets. The first dataset represents the population change for specific dates for U.S. regions. The second dataset represents Housing data over an extended period of time describing home age, number of bedrooms and other variables. The first row provides a column name for each dataset. The following columns should be used to perform analysis: PopChange.csv: Pop Apr 1 Pop Jul 1 Change Pop Housing.csv: AGE BEDRMS BUILT ROOMS UTILITY Notice for the Housing CSV file, there are more columns in the file than are required to be analyzed. You can and should still load each column. Specific statistics should include: Count Mean Standard Deviation Min Max Histogram A user interface might look similar to this: ***************** Welcome to the Python Data Analysis App********** Select the file you want to analyze: 1. Population Data 2.…
arrow_forward
So matalab please. Defined
arrow_forward
Follow these instructions:● Create a python program called taskXML.py. Write the code to:○ Read in the movie.xml file.○ Read about the iter() and itertext() function here. Use the iter()function to list all the child tags of the movie element.○ Use the itertext() function to print out the movie descriptions.○ Find the number of movies that are favourites and the number ofmovies that are not.
arrow_forward
In Java Language Write Code below Image
arrow_forward
sacramento.pyUse the sacramento.csv file to complete the following assignment. Create a file, sacramento.py, that loads the .csv file and runs a logistic regression. The regression should predict whether or not a house has 1 or more than one bathroom based on beds, sqft, and price, in that order. Note: you will not need to upload the .csv to CodeGrade because I have pre-loaded it.You will need to create a new variable from baths, and it should make it such that those observations of 1 bath correspond to a value of 0, and those with more than 1 bath correspond to a 1.Make sure to add a constant using sm.add_constant(X)Your file should print the results in this way:
print(mod.params.round(2))print(mod.pvalues.round(2))print('The smallest p-value is for sqft')
sacramento.csv…
arrow_forward
please code in python
You work in XYZ Company as a Python. The company officials want you to write code for reducing the dimensions of a dataset Tasks to be performed: - Using load_digits function from sklearn import wines data - Take a look at the shape of image data - Import PCA, LDA and FactorAnalysis from Sklearn - Project data in 2 D space using the PCA, LDA and FactorAnalysis algorithm form sklearn - Take a look at the new data
arrow_forward
5
Import tips.csv. This dataset has a column named sex. Write a function named recode gender that has one
parameter (gender) and will recode Male to 0 and Female to 1, and will return np. nan if the value is neither Male nor
Female. Apply this function to the column sex of tips using apply (). Print the first five lines of the new dataframe.
Code and Output
arrow_forward
Models are used for a variety of purposes. Sort the models into groups.
arrow_forward
Json document:
P =
{
"p": [
{
"logType": “PDF,
"accountId": “xxxx”
},
{
"logType": “PDF”,
"accountId": “xxxx”
},
{
"logType": “PDF”,
"accountId": “xxxx”
},
{
"logType": “PDF”,
"accountId": “xxxx”
},
{
"logType": “PDF”,
"accountId": “xxxx”
}
]
}
Then, I want to count the total of logType and accountId like this in python
{
"logType": 5,
"accountId": 5
}
arrow_forward
def print_table(
values: tuple[float,
) -> None:
], drag_coeff: float, increments: int, step: float
3
4
Parameters:
6
values (tuple[float, ...]): mass, force, ref_area, density,
init_velocity, lift_velocity, start_position, time_inc
drag_coeff (float): The drag coefficient.
increments (int): The number of drag coefficients displayed.
step (float): The difference between each drag coefficient.
8
9
10
11
12
Returns:
13
None
14
II II
15
For this function you need to compute the distance before lift-off for a range of drag coefficients
and then you need to print these results in a table.
The drag coefficient of an aeroplane has a significant impact on the plane's ability to lift-off. If the
drag coefficient is sufficiently high, the plane will not actually be able to generate enough speed to
lift off. In this task you will write a function which will explore this phenomenon.
arrow_forward
Using a random number generator, create a list of 500 integers. Perform a benchmark analysis using some of the sorting algorithms from this module. What is the difference in execution speed between the different sorting algorithms? In your paper, be sure to provide a brief discussion of the sorting algorithms used in this activity.
Your paper should be 2-3 pages in length (not including title and references pages) and conform to APA guidelines
arrow_forward
1. Read a give “data.csv” file, analyze the data, write the analysis result to“report.txt” file :in the report.txt file: include information of:1). How many rows in this dataset, for example: “This dataset has 10 rows”2). How many columns in this dataset, for example:”This dataset has 3 col-umns.”3). What are the name for the columns, print the all the column names, forexample, “The 3 columns are: name,age,gpa”4). How many numeric column(s), for example, “ This dataset has 2 numericcolumns, they are age, and gpa”5). The mean (avarage) of each column, for example, “The means are:mean1, mean2” python
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you
Np Ms Office 365/Excel 2016 I Ntermed
Computer Science
ISBN:9781337508841
Author:Carey
Publisher:Cengage
Related Questions
- LINKS : https://youtu.be/rR_mmsfIzzs Input for games - UWP applications | Microsoft Learnarrow_forwardPython Coding Question Use the sacramento.csv file to complete the following assignment. Create a file, sacramento.py, that loads the .csv file and runs a logistic regression. The regression should predict whether or not a house has 1 or more than one bathroom based on beds, sqft, and price, in that order. Note: you will not need to upload the .csv to CodeGrade because I have pre-loaded it. You will need to create a new variable from baths, and it should make it such that those observations of 1 bath correspond to a value of 0, and those with more than 1 bath correspond to a 1. Make sure to add a constant using sm.add_constant(X) Your file should print the results in this way: print(mod.params.round(2)) print(mod.pvalues.round(2)) print('The smallest p-value is for sqft')arrow_forwardIn the attached file, you will find the oil production for all countries that produce more than 1Mbpd, use pie chart and bar chart to show the percentage of production for each of them. country bpd United States 11567000 Russia 10503000 Saudi Arabia 10225000 Canada 4656000 Iraq 4260000 China 3969000 United Arab Emirates 2954000 Brazil 2852000 Kuwait 2610000 Iran 2546000 Kazakhstan 1937000 Norway 1744000 Mexico 1733000 Qatar 1297000 Nigeria Libya Angola 1258000 1220000 1158000arrow_forward
- TREE PROJECT There is a real program developed by a computer company that reads a report (running text ) and issues warnings on style and partially correct bad style. You are to write a simplified version of this program with the following features: Statistics A statistical summary with the following information: Total number of words in the report Number of unique words Number of unique words of more than three letters Average word length Average sentence length An index (alphabetical listing) of all the unique words (see next page for a specific format) Style Warnings Decrease Font Size Issue a warning in the following cases: Word used too often: list each unique word of more than three letters if its usage is more than 5% of the total number of words of more than three letters Sentence length : write a warning message if the average sentence length is greater than 10 Word length : write a warning message if the average word length is greater than 5 Input From the keyboard: The name…arrow_forwardPlease answer in matlab code. Download the data file AtlanticHurricanes20012020.csv, read in Matlab,and assign to the array hurrData:hurrData = readmatrix('AtlanticHurricanes20012020.csv'); Create a histogram plot showing the number of Hurricanes per year Label the x-axis Number of Hurricanes/year Label the y-axis Frequency Title the plot Hurricane Frequency Distribution 2001-2020 Save the figure as an emf file Create a bar plot showing annual hurricaines occurence Set the x = to the year; y = number of hurricanes Label the x-axis Year Label the y-axis Number of Hurricanes Title the plot Annual Hurricane Occurrence 2001-2020 Save the figure as an emf file. Create a line plot showing annual hurricaines occurence Set the x = to the year; y = number of hurricanes. The curve should be a red line with square symbols. Label the x-axis Year Label the y-axis Number of Hurricanes Title the plot Annual Hurricane Occurrence 2001-2020 Save the figure as an emf file. Plot the histogram, the bar…arrow_forwardCreate a new workbook as shown below and save the file with the name "Call Statistics". 1 Panda EST Monthly Sales Report - July 2 3 Sales Amount 1600 1800 Total Salary 4 Emp. No. Name 5 S101 6 S105 7 S112 8 s107 9 S110 Salary Comission 2500 ? 3000 Ahmed Hassan Ali 1500 2200 Waleed Mohammed Samir 4500 3500 2000 1700 10 s103 1600 2500 11 Totals Average Highest Lowest Count 12 13 12 14 15 16 a) Create the worksheet shown above. b) Set the column widths as follows: Column A: 8, Column B: 14, Columns C & D: 15, Columns E & F: 14. c) Enter the formula to find COMMISSION for the first employee. The commission rate is 2% of sales, COMMISSION = SALES * 2% Copy the formula to the remaining employees. d) Enter the formula to find TOTAL SALARY for the first employee where: TOTAL SALARY = SALARY + COMMISSION Copy the formula to the remaining employees. e) Enter formula to find TOTALS, AVERAGE, HIGHEST, LOWEST, and COUNT values. Copy the formula to each column. f) Format numeric data to include…arrow_forward
- Horizontal sequence :VIRL Vertical sequence:MKF Scoring rules: g/o = -3, g/e = -1, match or mismatch - from PAM250 substitution matrix below. SW algorithm. 1. Complete the scoring matrix. Scoring matrix with PAM250 scores: V I R L M K F 2. Set up, initialize and complete the SW matrix. 3. Retrace, align and score alignment(s). Use the arrows and circles for the matrix and path(s). V I R L M K F Align and score all optimal alignments here. PLZ the arrows and circles for the matrix and path(s) AND SHOW ALL possible Alignmentarrow_forwardHorizontal sequence :VIRL Vertical sequence:MKF Scoring rules: g/o = -3, g/e = -1, match or mismatch - from PAM250 substitution matrix below. NW algorithm. 1. Complete the scoring matrix. Scoring matrix with PAM250 scores: V I R L M K F 2. Set up, initialize and complete the NW matrix. 3. Retrace, align and score alignment(s). Use the arrows and circles for the matrix and path(s). V I R L M K F Align and score all optimal alignments here. PLZ the arrows and circles for the matrix and path(s) AND SHOW ALL possible Alignmentarrow_forwardHorizontal sequence :VIRL Vertical sequence:MKF Scoring rules: g/o = -3, g/e = -1, match or mismatch - from PAM250 substitution matrix below. NW algorithm. 1. Complete the scoring matrix. Scoring matrix with PAM250 scores: V I R L M K F 2. Set up, initialize and complete the NW matrix. 3. Retrace, align and score alignment(s). Use the arrows and circles for the matrix and path(s). V I R L M K F Align and score all optimal alignments here.arrow_forward
- Can somebody help me with my homework? I provided screenshots of my code that goes alongside the question. *Side note comments would be helpful, but not required.arrow_forwardHorizontal sequence :RIVL Vertical sequence:FMK Scoring rules: g/o = -3, g/e = -1, match or mismatch - from PAM250 substitution matrix below. SW algorithm. 1. Complete the scoring matrix. Scoring matrix with PAM250 scores: R I V L F M K 2. Set up, initialize and complete the SW matrix. 3. Retrace, align and score alignment(s). Use the arrows and circles for the matrix and path(s). R I V L F M K Align and score all optimal alignments here. PLZ the arrows and circles for the matrix and path(s) AND SHOW ALL possible Alignmentarrow_forwardOpen the Excel file Student_Excel_Intro_Cap1_Year_End_Report.xlsx downloaded with this project. On the Net Sales worksheet, calculate totals in the ranges F4:F8 and B9:F9. Apply the Total cell style to the range B9:F9. Using absolute cell references as necessary, in cell G4, construct a formula to calculate the percent that the Colorado Total is of Total Sales, and then apply Percent Style with zero decimals. Fill the formula down through the range G5:G8. In the range H4:H8, insert Line sparklines to represent the trend of each state across the four quarters. Do not include the totals. Display Markers. Select the range A3:E8, and then use the Recommended Charts command to suggest an appropriate chart. Click the first Clustered Column chart that uses the state names as the category axis. Align the upper left corner of the chart inside the upper left corner of cell A11, and then size the chart so that its lower right corner is slightly inside cell H24. Apply chart Style…arrow_forward
arrow_back_ios
SEE MORE QUESTIONS
arrow_forward_ios
Recommended textbooks for you
- Np Ms Office 365/Excel 2016 I NtermedComputer ScienceISBN:9781337508841Author:CareyPublisher:Cengage
Np Ms Office 365/Excel 2016 I Ntermed
Computer Science
ISBN:9781337508841
Author:Carey
Publisher:Cengage