lab01_solutions

.pdf

School

University of California, Berkeley *

*We aren’t endorsed by this school

Course

102

Subject

Computer Science

Date

Feb 20, 2024

Type

pdf

Pages

Uploaded by ProfComputer848

lab01_solutions October 4, 2022 1 Lab 1: Basics of Testing Welcome to the first Data 102 lab! The goals of this lab are to get familiar with concepts in decision theory. We will learn more about testing, p-values and FDR control. The code you need to write is commented out with a message “TODO: fill…” . There is additional documentation for each part as you go along. 1.1 Collaboration Policy Data science is a collaborative activity. While you may talk with others about the labs, we ask that you write your solutions individually . If you do discuss the assignments with others please include their names in the cell below. 1.2 Submission To submit this assignment, rerun the notebook from scratch (by selecting Kernel > Restart & Run all), and then print as a pdf (File > download as > pdf) and submit it to Gradescope. For full credit, this assignment should be completed and submitted before Friday, September 9, 2022 at 11:59 PM. PST 1.3 Collaborators Write the names of your collaborators in this cell. <Collaborator Name> <Collaborator e-mail> 2 Setup Let’s begin by importing the libraries we will use. You can find the documentation for the libraries here: * matplotlib: https://matplotlib.org/3.1.1/contents.html * numpy: https://docs.scipy.org/doc/ * pandas: https://pandas.pydata.org/pandas-docs/stable/ * seaborn: https://seaborn.pydata.org/ [1]: import matplotlib.pyplot as plt import numpy as np import pandas as pd import seaborn as sns 1

import scipy.stats from scipy.stats import norm import hashlib % matplotlib inline sns . set(style = "dark" ) plt . style . use( "ggplot" ) def get_hash (num): # <- helper function for assessing correctness return hashlib . md5( str (num) . encode()) . hexdigest() 3 Question 1: Hypothesis testing, LRT, decision rules, P-values. The first question looks at the basics of testing. You will have to put yourself in the shoes of a detective who is trying to use ‘evidence’ to find the ‘truth’. Given a piece of evidence 𝑋 your job will be to decide between two hypotheses. The two hypothesis you consider are: The null hypothesis: 𝐻 0 ∶ 𝑋 ∼ 𝒩(0, 1) The alternative hypothesis: 𝐻 1 ∶ 𝑋 ∼ 𝒩(2, 1) Granted you don’t know the truth, but you have to make a decision that maximizes the True Positive Probability and minimizes the False Positive Probability. In this exercise you will look at: - The intuitive relationship between Likelihood Ratio Test and decisions based on thresholding 𝑋 . - The performance of a level- 𝛼 test. - The distribution of p-values for samples from the null distribution as well as samples from the alternative. Let’s start by plotting the distributions of the null and alternative hypothesis. [2]: # NOTE: you just need to run this cell to plot the pdf; don't change this code. def null_pdf (x): return norm . pdf(x, 0 , 1 ) def alt_pdf (x): return norm . pdf(x, 2 , 1 ) # Plot the distribution under the null and alternative x_axis = np . arange( -4 , 6 , 0.001 ) plt . plot(x_axis, null_pdf(x_axis), label = '$H_0$' ) # <- likelihood under the ␣ ↪ null plt . fill_between(x_axis, null_pdf(x_axis), alpha = 0.3 ) plt . plot(x_axis, alt_pdf(x_axis), label = '$H_1$' ) # <- likelihood alternative plt . fill_between(x_axis, alt_pdf(x_axis), alpha = 0.3 ) 2

plt . xlabel( "X" ) plt . ylabel( "Likelihood" ) plt . title( "Comparison of null and alternative likelihoods" ); plt . legend() plt . tight_layout() plt . show() By inspecting the image above we can see that if the data lies towards the right, then it seems more plausible that the alternative is true. For example 𝑋 ≥ 1.64 seems much less likely to belong to the null pdf than the alternative pdf. 3.0.1 Likelihood Ratio Test In class we said that the optimal test is the Likelihood Ratio Test (LRT), which is the result of the celebrated Neyman-Pearson Lemma. It says that the optimal level 𝛼 test is the one that rejects the null (aka makes a discovery, favors the alternative) whenever: 𝐿𝑅(𝑥) ∶= 𝑓 1 (𝑥) 𝑓 0 (𝑥) ≥ 𝜂 where 𝜂 is chosen such that the false positive rate is equal to 𝛼 . 3

3.0.2 But how does this result fit with the intuition that we should set a decision threshold based on the value of 𝑋 directly? This exercise will formalize that intuition: Let’s start by computing the ratio of the likelihoods. The likelihood of 𝑋 ∼ 𝒩(𝜇, 𝜎) is: 𝑓 𝜎,𝜇 (𝑥) = 1 𝜎 √ 2𝜋 𝑒 − (𝑥−𝜇) 2 2𝜎 2 Luckily scipy has a nifty function to compute the likelihood of gaussians scipy.norm.pdf(x, mu, sigma) 3.1 Part 1.a: Calculate likelihood ratios Complete the function below that computes the likelihood ratio for any value x . [3]: # TODO: fill in the missing expression for the likelihood ratio in the function ␣ ↪ below def calculate_likelihood_ratio (x): """ Computes the likelihood ratio between the alternative and null hypothesis. Inputs: x: value for which to compute the likelihood ratio Outputs: lr : the likelihood ratio at point x """ L0 = null_pdf(x) L1 = alt_pdf(x) LR = L1 / L0 # TODO: fill the likelihood ratio return LR [4]: # Compute the likelihood ratio for X=1.64 X =1.64 LR = calculate_likelihood_ratio(X) print (LR) assert (get_hash(LR) == 'f9983e1a6585502f3006cb6d1c1edec3' ) print ( "Test passed!" ) 3.59663972556928 Test passed! Let’s plot the likelihood ratios for different values of 𝑋 : [5]: # The code below plots the LR for different values of X # Once you've filled in `calculate_likelihood_ratio` run this cell and inspect ␣ ↪ the plot 4

x_axis = np . arange( -1 , 3 , 0.001 ) plt . plot(x_axis, calculate_likelihood_ratio(x_axis)) plt . vlines(X, 0 , LR, linestyle = "dotted" , color = 'k' ) plt . hlines(LR, -1 , X, linestyle = "dotted" , color = 'k' ) plt . scatter(X, LR, 30 , color = 'k' ) plt . xlabel( "X" ) plt . ylabel( "Likelihood Ratio" ) plt . title( "Comparison of null and alternative likelihoods" ); plt . tight_layout() plt . show() The plot above illustrates that deciding based on LRT with 𝜂 = 3.6 (the dotted horizontal line) is equivalent to deciding in the favor of the alternative whenever 𝑋 ≥ 1.64 (the dotted vertical line). The set [1.64, +∞) is called the rejection region of the test, because for all X values in the rejection region the test rejects the null in the favor of the alternative. This illustrates that our intuition was correct. When thinking in terms of likelihood ratios it seems very tricky to compute the False Positive Rate (FPR), however in this case we can bypass that by testing based on the value of 𝑋 . 5

The figure below illustrates pictorially the FPR when testing based on the threshold 𝑋 ≥ 1.64 [6]: x_axis = np . arange( -4 , 5 , 0.001 ) plt . plot(x_axis, null_pdf(x_axis), label = '$H_0$' ) # <- likelihood under the ␣ ↪ null plt . plot(x_axis, alt_pdf(x_axis), label = '$H_1$' ) # <- likelihood alternative rejection_region = np . arange(X, 5 , 0.001 ) # <- truncate the true rejection ␣ ↪ region for plotting purposes plt . fill_between(rejection_region, null_pdf(rejection_region), alpha = 0.3 , ␣ ↪ label = "FPR" ) plt . xlabel( "X" ) plt . ylabel( "Likelihood" ) plt . title( "Comparison of null and alternative likelihoods" ); plt . legend() plt . tight_layout() plt . show() 6

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version