lab01_solutions
.pdf
keyboard_arrow_up
School
University of California, Berkeley *
*We aren’t endorsed by this school
Course
102
Subject
Computer Science
Date
Feb 20, 2024
Type
Pages
19
Uploaded by ProfComputer848
lab01_solutions
October 4, 2022
1
Lab 1: Basics of Testing
Welcome to the first Data 102 lab!
The goals of this lab are to get familiar with concepts in decision theory. We will learn more about
testing, p-values and FDR control.
The code you need to write is commented out with a message
“TODO: fill…”
. There is additional
documentation for each part as you go along.
1.1
Collaboration Policy
Data science is a collaborative activity. While you may talk with others about the labs, we ask that
you
write your solutions individually
.
If you do discuss the assignments with others please
include their names
in the cell below.
1.2
Submission
To submit this assignment, rerun the notebook from scratch (by selecting Kernel > Restart & Run
all), and then print as a pdf (File > download as > pdf) and submit it to Gradescope.
For full credit, this assignment should be completed and submitted before Friday,
September 9, 2022 at 11:59 PM. PST
1.3
Collaborators
Write the names of your collaborators in this cell.
<Collaborator Name> <Collaborator e-mail>
2
Setup
Let’s
begin
by
importing
the
libraries
we
will
use.
You
can
find
the
documentation
for
the
libraries
here:
*
matplotlib:
https://matplotlib.org/3.1.1/contents.html
*
numpy:
https://docs.scipy.org/doc/ * pandas: https://pandas.pydata.org/pandas-docs/stable/ * seaborn:
https://seaborn.pydata.org/
[1]:
import
matplotlib.pyplot
as
plt
import
numpy
as
np
import
pandas
as
pd
import
seaborn
as
sns
1
import
scipy.stats
from
scipy.stats
import
norm
import
hashlib
%
matplotlib
inline
sns
.
set(style
=
"dark"
)
plt
.
style
.
use(
"ggplot"
)
def
get_hash
(num):
# <- helper function for assessing correctness
return
hashlib
.
md5(
str
(num)
.
encode())
.
hexdigest()
3
Question 1: Hypothesis testing, LRT, decision rules, P-values.
The first question looks at the basics of testing. You will have to put yourself in the shoes of a
detective who is trying to use ‘evidence’ to find the ‘truth’. Given a piece of evidence
𝑋
your job
will be to decide between two hypotheses. The two hypothesis you consider are:
The null hypothesis:
𝐻
0
∶ 𝑋 ∼ 𝒩(0, 1)
The alternative hypothesis:
𝐻
1
∶ 𝑋 ∼ 𝒩(2, 1)
Granted you don’t know the truth, but you have to make a decision that maximizes the True
Positive Probability and minimizes the False Positive Probability.
In this exercise you will look at:
- The intuitive relationship between Likelihood Ratio Test
and decisions based on thresholding
𝑋
. - The performance of a level-
𝛼
test. - The distribution of
p-values for samples from the null distribution as well as samples from the alternative.
Let’s start by plotting the distributions of the null and alternative hypothesis.
[2]:
# NOTE: you just need to run this cell to plot the pdf; don't change this code.
def
null_pdf
(x):
return
norm
.
pdf(x,
0
,
1
)
def
alt_pdf
(x):
return
norm
.
pdf(x,
2
,
1
)
# Plot the distribution under the null and alternative
x_axis
=
np
.
arange(
-4
,
6
,
0.001
)
plt
.
plot(x_axis, null_pdf(x_axis), label
=
'$H_0$'
)
# <- likelihood under the
␣
↪
null
plt
.
fill_between(x_axis, null_pdf(x_axis), alpha
= 0.3
)
plt
.
plot(x_axis, alt_pdf(x_axis),
label
=
'$H_1$'
)
# <- likelihood alternative
plt
.
fill_between(x_axis, alt_pdf(x_axis), alpha
= 0.3
)
2
plt
.
xlabel(
"X"
)
plt
.
ylabel(
"Likelihood"
)
plt
.
title(
"Comparison of null and alternative likelihoods"
);
plt
.
legend()
plt
.
tight_layout()
plt
.
show()
By inspecting the image above we can see that if the data lies towards the right, then it seems
more plausible that the alternative is true. For example
𝑋 ≥ 1.64
seems much less likely to belong
to the null pdf than the alternative pdf.
3.0.1
Likelihood Ratio Test
In class we said that the optimal test is the Likelihood Ratio Test (LRT), which is the result of the
celebrated Neyman-Pearson Lemma. It says that the optimal level
𝛼
test is the one that rejects
the null (aka makes a discovery, favors the alternative) whenever:
𝐿𝑅(𝑥) ∶=
𝑓
1
(𝑥)
𝑓
0
(𝑥)
≥ 𝜂
where
𝜂
is chosen such that the false positive rate is equal to
𝛼
.
3
3.0.2
But how does this result fit with the intuition that we should set a decision
threshold based on the value of
𝑋
directly?
This exercise will formalize that intuition:
Let’s start by computing the ratio of the likelihoods. The likelihood of
𝑋 ∼ 𝒩(𝜇, 𝜎)
is:
𝑓
𝜎,𝜇
(𝑥) =
1
𝜎
√
2𝜋
𝑒
−
(𝑥−𝜇)
2
2𝜎
2
Luckily
scipy
has a nifty function to compute the likelihood of gaussians
scipy.norm.pdf(x, mu,
sigma)
3.1
Part 1.a: Calculate likelihood ratios
Complete the function below that computes the likelihood ratio for any value
x
.
[3]:
# TODO: fill in the missing expression for the likelihood ratio in the function
␣
↪
below
def
calculate_likelihood_ratio
(x):
"""
Computes the likelihood ratio between the alternative and null hypothesis.
Inputs:
x: value for which to compute the likelihood ratio
Outputs:
lr : the likelihood ratio at point x
"""
L0
=
null_pdf(x)
L1
=
alt_pdf(x)
LR
=
L1
/
L0
# TODO: fill the likelihood ratio
return
LR
[4]:
# Compute the likelihood ratio for X=1.64
X
=1.64
LR
=
calculate_likelihood_ratio(X)
print
(LR)
assert
(get_hash(LR)
==
'f9983e1a6585502f3006cb6d1c1edec3'
)
print
(
"Test passed!"
)
3.59663972556928
Test passed!
Let’s plot the likelihood ratios for different values of
𝑋
:
[5]:
# The code below plots the LR for different values of X
# Once you've filled in `calculate_likelihood_ratio` run this cell and inspect
␣
↪
the plot
4
x_axis
=
np
.
arange(
-1
,
3
,
0.001
)
plt
.
plot(x_axis, calculate_likelihood_ratio(x_axis))
plt
.
vlines(X,
0
, LR, linestyle
=
"dotted"
, color
=
'k'
)
plt
.
hlines(LR,
-1
, X, linestyle
=
"dotted"
, color
=
'k'
)
plt
.
scatter(X, LR,
30
, color
=
'k'
)
plt
.
xlabel(
"X"
)
plt
.
ylabel(
"Likelihood Ratio"
)
plt
.
title(
"Comparison of null and alternative likelihoods"
);
plt
.
tight_layout()
plt
.
show()
The plot above illustrates that deciding based on LRT with
𝜂 = 3.6
(the dotted horizontal line)
is equivalent to deciding in the favor of the alternative whenever
𝑋 ≥ 1.64
(the dotted vertical
line). The set
[1.64, +∞)
is called the
rejection region
of the test, because for all X values in the
rejection region the test rejects the null in the favor of the alternative. This illustrates that our
intuition was correct.
When thinking in terms of likelihood ratios it seems very tricky to compute the False Positive Rate
(FPR), however in this case we can bypass that by testing based on the value of
𝑋
.
5
The figure below illustrates pictorially the FPR when testing based on the threshold
𝑋 ≥ 1.64
[6]:
x_axis
=
np
.
arange(
-4
,
5
,
0.001
)
plt
.
plot(x_axis, null_pdf(x_axis), label
=
'$H_0$'
)
# <- likelihood under the
␣
↪
null
plt
.
plot(x_axis, alt_pdf(x_axis),
label
=
'$H_1$'
)
# <- likelihood alternative
rejection_region
=
np
.
arange(X,
5
,
0.001
)
# <- truncate the true rejection
␣
↪
region for plotting purposes
plt
.
fill_between(rejection_region, null_pdf(rejection_region), alpha
= 0.3
,
␣
↪
label
=
"FPR"
)
plt
.
xlabel(
"X"
)
plt
.
ylabel(
"Likelihood"
)
plt
.
title(
"Comparison of null and alternative likelihoods"
);
plt
.
legend()
plt
.
tight_layout()
plt
.
show()
6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Questions
Which of the following is not an E-R model relationship?a. some-to-manyb. one-to-onec. one-to-manyd. many-to-many
arrow_forward
Predictive modeling:
Task: Ambulance Demand
Data Generating Process: for New York City
Dates of Coverage {Please identify the specific dates that will be used for this dataset in your modeling effort other than the City Health Department's Emergency Medical Services Division.
}
Frequency of data collection {how often is the data collected? After every incident? Daily? Yearly?}
Agency / Organization collecting the data {who specifically is collecting the data? Please avoid using general references like “government” or “police}
Original Unit of Analysis {What is the original unit of analysis for the data as provided? Calls for service? Census tracts? Cities?}
Transformed Unit of Analysis{i.e. are you modifying the call data to support your model? Hint: if you are doing “demand” model you will be aggregating the data.}
Data Generation Description{here, I want you in your own words to describe how you think the data was generated. Think 2-3 sentences.}
Data…
arrow_forward
DataTypes-Attached
arrow_forward
Database Design
Testing (ID_testing,date,observerId)
Observer(observerId,observer_name)
Type(TypeId,typename)
Essentialoil (ID_essential,essentialname,typeId,batchcode)
Constituent (ID_constituent,cname,min_percentage,Test_percentage,status)
Before an Essential Oil product is marketed, testing is carried out based on the production batch from any essential oil. This is done to maintain the quality and content of the existing constituents in essential oils. Where the Test Percentage Content must reach the Minimum standard Percentage Content for each constituent. If there is one type of constituent that does not fulfill minimum standard, the batch will not be marketed because it does not meet the standard quality.
Question
1. Make an ERD (Entitiy Relationship Diagram) based on the results of Normalization
2. Develop the ERD that you create to implementGeneralization / Specialization.
arrow_forward
Hİ, can you help me PLEASE
THANK YOU
MY tutor say fix your mistakes, ER Table
would you arrange the table way the tutor wanted?
my mistakes ;
An example is not showing primary or foreign key, one to many notations.
Produce an ERD from a given scenario using correct:
Notations
Naming conventions
•Relationships.
arrow_forward
Describe how you will collect data when using the formative assessment as described below.
Identify the key skills required to meet the learning objective. In this case, the core skill is rounding two- and three-digit whole numbers to the nearest 10 or 100. The content is the concept of rounding and the rules for rounding numbers.
Design the formative assessment. This could be a worksheet with a variety of problems for the student to solve. The student would be asked to show all their work on the worksheet so that I could see their thought process and identify any areas where they may be struggling.
Monitor the student's progress. As the student works through the worksheet, observe their problem-solving process. These observations would help identify any areas where the student may need additional support or instruction.
Provide feedback. After the student has completed the worksheet, I would review their work and provide feedback. Highlight areas where they did well and areas where…
arrow_forward
Create a data model design
arrow_forward
ObjectiveTo develop a simple database application using modern programming tools.ToolsStudents should use python as their main development platform. They can also choose either one of the database management systems SQLite, MySQL, PostgreSQL, and Microsoft SQL Server.GuidelinesEach student should agree to work with one student from the class as a teammate. Otherwise, the student works alone and gets penalized by getting at most 75% of the project grade. Teams should work independently toward the same project goal. Each team has to prepare a report presenting all the details about the work done and the job division. The number of pages should not pass twenty.Description of the ProjectThe project is to develop a simple university management system. The application should let three different kinds of users to work, namely, the student, the instructor, and the faculty coordinator. Users should have the following functionalities in the system:Faculty coordinator Log in to the system Add…
arrow_forward
What actions can you take to make sure your model has all the data it needs?
How may problem statements benefit from data modeling techniques?
arrow_forward
Design ER diagram (database) for blog
aplication.
note: not html or php
ER diagram (database) must contain
User view:
Minimum requirements:
Sign-up page
Sign-in page
List of all posts shared by all users
Add a new post
List of all own posts
Edit own posts
Delete own posts
Admin view:
i. Minimum requirements:
Sign-in
List of all posts shared by all users
List of waiting posts for approval
Approve/reject a post
arrow_forward
Self-employed Individuals car loan
consider the suitable evaluation process to measures individual credit worthiness(eg:5Cs) by creating a template in excel to insert nessery information .
#you can create your own data
arrow_forward
Exercise 1. Normalize following unnormalized relation
PROJECT PRCHCY
NUMBER NAME
Evergreen
EMPLOYEE
NUMBER
CHARGE
HOUR
HOURS
ILLED
EMPLOE
TOTAL
NAME
June E. Arbough
John G. News
Alice K. kohnson
William Smithfield
David H. Senior
CLASS
Elec. Engineer
Database Designer
Database Designer
$ 2,034.90
$ 2,037.00
$1,748.50
$ 450.45
15
103
$85.50
230
101
$105.00
19.4
105
$105.00
35.7
$ 35.75
$ 6.75
106
126
102
Systems Analyst
Subtotal
Applications Designer
General Support
Systems Analyst
DSS Analyst
218
$2.302.65
S10,573.50
14
$ 48.10
$ 18.36
$ 9675
$ 45.95
$ 1,183.26
$ 831.71
53,134.70
$2.067.75
$ 7.265.52
$ 6,998.50
$4,682.70
$ 1,135.16
S S91.14
$ 457.60
$1,765.10
$915.20
$4,431.15
$ 5,911.50
$ 1,592.11
$ 2,203.30
$ 559.98
$1902.33
$17,595.57
18
Amber Wave
Annelne Jones
James Frommer
Anne K. Ramoras
Darlene M. Smithon
25.6
118
45.3
104
324
112
45.0
Subtotal
$105.00
$6.75
$ 48.10
Roling Tide
Database Designer
Systems Analyst
Applications Deigner
Clerical Support
Programmer
Subtotal
22
105…
arrow_forward
Alert - don't use any AI platform to generate answer and don't try to copy others'work otherwise I'll reduce rating.
arrow_forward
What Are the Most Common Errors You Can Potentially Face in Data Modeling?
arrow_forward
Normalization
arrow_forward
Subject: Clarification on Project Submission for Dataset 2: Modified Adult Salary Dataset - Binary Classification Model
I am working on the project for the "Modified Adult Salary Dataset - Binary Classification Model" and would appreciate your assistance in doing this project , and meeting all the requirements.
I am focusing on the following aspects:
Data Exploration: Gaining insights into the dataset, identifying potential issues such as missing values, outliers, or class imbalance.
Data Visualization: Utilizing the appropriate techniques to visualize relationships and distributions in the data.
Data Preprocessing: Handling any data inconsistencies, performing encoding and scaling as needed, and considering methods to balance the dataset.
Model Selection and Tuning: Implementing neural network models, performing hyperparameter tuning, and evaluating the performance of the best model.
If there are any areas that need further clarification or improvement, I would appreciate your…
arrow_forward
Data1-(100,000, 30), Data2 - (250,000, 10) - Which data do you think
would be a better model to build and why? (assuming the stats are
the same in both)
arrow_forward
Can the data models be reused in whole or in part on multiple projects? How?
arrow_forward
Data Preparation
Describe what you did with the data prior to
the modelling process (data cleaning). Show
histograms of the one example variable
before and after any pre-processing that
you carried out. If you corrected any mis-
typed entries in the data, report what you
changed. Carry out descriptive analysis and
explain each what they represent, use as
many graph as possible and provide the
descriptions
arrow_forward
Please help me answer this engineering question
arrow_forward
Discuss the performance considerations associated with data binding in large-scale applications. How can developers optimize data binding to improve application responsiveness?
arrow_forward
Alert dont submit AI generated answer.
arrow_forward
The data model excludes an item for what reason?
arrow_forward
Explain the concept of "lazy loading" and its relevance in data binding. How does lazy loading affect the efficiency and responsiveness of web applications?
arrow_forward
//ER diagram not handwritten please.Use any tool and send the image
//Also can the question be answered in subparts according to the question numbers please
//Need two parts of the question answered please
Terrific Airlines is a newly formed airline aimed at the burgeoning market of clandestine travellers (fugitives, spies, confidence tricksters, scoundrels, deadbeats, cheating spouses, politicians, etc.). Terrific Airlines needs a database to track flights, customers, fares, airplane performance, and personnel assignment. Since Terrific Airlines is promoted as a “…fast way out of town,” individual seats are not assigned, and flights of other carriers are not tracked. More specific notes about Terrific Airlines are listed below:
Information about a route includes its unique number, its origin, its destination, and estimated departure and arrival times. To reduce costs, Terrific Airlines only has non-stop flights with a single origin and destination.
Flights are scheduled for a route…
arrow_forward
Do you have any recommendations for the capturing of the data flow?
arrow_forward
Describe why it is important to evaluate a data model
arrow_forward
Case Study 1:
Nancy is leading a Scrum project for her organization. The project is to create new software for the Accounting Department. She is meeting with Tom, the director for the accounting department and the project team members to discuss the requirements of the project. Tom, Nancy, and the project team have identified all the requirements that Tom would like the app to have, but now Nancy wants to organize the list of requirements in a prioritized view.
Based on this scenario, what role is Nancy?
What role does Tom play?
What can the scrum team do to help Nancy and Tom at this point of the project?
What is the list of requirements called?
Who should be prioritizing the list of requirements in this scenario?
arrow_forward
Create the SQL statements that go along with this scenario.
arrow_forward
Describe the challenges and solutions for implementing data binding in real-time applications, such as collaborative editing tools or live dashboards.
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you
Related Questions
- Which of the following is not an E-R model relationship?a. some-to-manyb. one-to-onec. one-to-manyd. many-to-manyarrow_forwardPredictive modeling: Task: Ambulance Demand Data Generating Process: for New York City Dates of Coverage {Please identify the specific dates that will be used for this dataset in your modeling effort other than the City Health Department's Emergency Medical Services Division. } Frequency of data collection {how often is the data collected? After every incident? Daily? Yearly?} Agency / Organization collecting the data {who specifically is collecting the data? Please avoid using general references like “government” or “police} Original Unit of Analysis {What is the original unit of analysis for the data as provided? Calls for service? Census tracts? Cities?} Transformed Unit of Analysis{i.e. are you modifying the call data to support your model? Hint: if you are doing “demand” model you will be aggregating the data.} Data Generation Description{here, I want you in your own words to describe how you think the data was generated. Think 2-3 sentences.} Data…arrow_forwardDataTypes-Attachedarrow_forward
- Database Design Testing (ID_testing,date,observerId) Observer(observerId,observer_name) Type(TypeId,typename) Essentialoil (ID_essential,essentialname,typeId,batchcode) Constituent (ID_constituent,cname,min_percentage,Test_percentage,status) Before an Essential Oil product is marketed, testing is carried out based on the production batch from any essential oil. This is done to maintain the quality and content of the existing constituents in essential oils. Where the Test Percentage Content must reach the Minimum standard Percentage Content for each constituent. If there is one type of constituent that does not fulfill minimum standard, the batch will not be marketed because it does not meet the standard quality. Question 1. Make an ERD (Entitiy Relationship Diagram) based on the results of Normalization 2. Develop the ERD that you create to implementGeneralization / Specialization.arrow_forwardHİ, can you help me PLEASE THANK YOU MY tutor say fix your mistakes, ER Table would you arrange the table way the tutor wanted? my mistakes ; An example is not showing primary or foreign key, one to many notations. Produce an ERD from a given scenario using correct: Notations Naming conventions •Relationships.arrow_forwardDescribe how you will collect data when using the formative assessment as described below. Identify the key skills required to meet the learning objective. In this case, the core skill is rounding two- and three-digit whole numbers to the nearest 10 or 100. The content is the concept of rounding and the rules for rounding numbers. Design the formative assessment. This could be a worksheet with a variety of problems for the student to solve. The student would be asked to show all their work on the worksheet so that I could see their thought process and identify any areas where they may be struggling. Monitor the student's progress. As the student works through the worksheet, observe their problem-solving process. These observations would help identify any areas where the student may need additional support or instruction. Provide feedback. After the student has completed the worksheet, I would review their work and provide feedback. Highlight areas where they did well and areas where…arrow_forward
- Create a data model designarrow_forwardObjectiveTo develop a simple database application using modern programming tools.ToolsStudents should use python as their main development platform. They can also choose either one of the database management systems SQLite, MySQL, PostgreSQL, and Microsoft SQL Server.GuidelinesEach student should agree to work with one student from the class as a teammate. Otherwise, the student works alone and gets penalized by getting at most 75% of the project grade. Teams should work independently toward the same project goal. Each team has to prepare a report presenting all the details about the work done and the job division. The number of pages should not pass twenty.Description of the ProjectThe project is to develop a simple university management system. The application should let three different kinds of users to work, namely, the student, the instructor, and the faculty coordinator. Users should have the following functionalities in the system:Faculty coordinator Log in to the system Add…arrow_forwardWhat actions can you take to make sure your model has all the data it needs? How may problem statements benefit from data modeling techniques?arrow_forward
- Design ER diagram (database) for blog aplication. note: not html or php ER diagram (database) must contain User view: Minimum requirements: Sign-up page Sign-in page List of all posts shared by all users Add a new post List of all own posts Edit own posts Delete own posts Admin view: i. Minimum requirements: Sign-in List of all posts shared by all users List of waiting posts for approval Approve/reject a postarrow_forwardSelf-employed Individuals car loan consider the suitable evaluation process to measures individual credit worthiness(eg:5Cs) by creating a template in excel to insert nessery information . #you can create your own dataarrow_forwardExercise 1. Normalize following unnormalized relation PROJECT PRCHCY NUMBER NAME Evergreen EMPLOYEE NUMBER CHARGE HOUR HOURS ILLED EMPLOE TOTAL NAME June E. Arbough John G. News Alice K. kohnson William Smithfield David H. Senior CLASS Elec. Engineer Database Designer Database Designer $ 2,034.90 $ 2,037.00 $1,748.50 $ 450.45 15 103 $85.50 230 101 $105.00 19.4 105 $105.00 35.7 $ 35.75 $ 6.75 106 126 102 Systems Analyst Subtotal Applications Designer General Support Systems Analyst DSS Analyst 218 $2.302.65 S10,573.50 14 $ 48.10 $ 18.36 $ 9675 $ 45.95 $ 1,183.26 $ 831.71 53,134.70 $2.067.75 $ 7.265.52 $ 6,998.50 $4,682.70 $ 1,135.16 S S91.14 $ 457.60 $1,765.10 $915.20 $4,431.15 $ 5,911.50 $ 1,592.11 $ 2,203.30 $ 559.98 $1902.33 $17,595.57 18 Amber Wave Annelne Jones James Frommer Anne K. Ramoras Darlene M. Smithon 25.6 118 45.3 104 324 112 45.0 Subtotal $105.00 $6.75 $ 48.10 Roling Tide Database Designer Systems Analyst Applications Deigner Clerical Support Programmer Subtotal 22 105…arrow_forward
arrow_back_ios
SEE MORE QUESTIONS
arrow_forward_ios
Recommended textbooks for you