Assignment 12 - Problems

.html

School

Northeastern University *

*We aren’t endorsed by this school

Course

6400

Subject

Industrial Engineering

Date

Feb 20, 2024

Type

html

Pages

5

Uploaded by CaptainSparrowMaster1000

IE6400 Foundations for Data Analytics Engineering Fall 2023 Assignment 12 Module 2: Probability Question 1: Monty Hall Problem Simulation and Analysis Background: The Monty Hall problem is a famous probability puzzle named after the host of the television game show "Let's Make a Deal." The problem goes as follows: A contestant is presented with three doors. Behind one of them is a car (which the contestant wants), and behind the other two are goats. The contestant selects one of the doors, say Door A. The host, Monty Hall, who knows what's behind each door, opens another door, say Door B, revealing a goat. Monty now asks the contestant if they want to stick with their initial choice (Door A) or switch to the remaining door (Door C). The contestant makes a decision: stick or switch. The question is, is it in the contestant's best interest to stick with their initial choice, switch, or does it not matter? Task: Your goal is to simulate the Monty Hall problem using Python and determine the empirical probabilities of winning the car for both strategies: sticking with the initial choice and switching after Monty reveals a goat. Dataset: 'monty_hall_trials.csv' In [ ]: import pandas as pd import numpy as np np.random.seed(999) def simulate_monty_hall(num_trials): doors = ['A', 'B', 'C'] results = [] for i in range(num_trials): car_location = np.random.choice(doors) initial_choice = np.random.choice(doors) remaining_doors = [door for door in doors if door != initial_choice and door != car_location] monty_reveal = np.random.choice(remaining_doors) # Simulating equal probability of sticking or switching final_decision = np.random.choice(['Stick', 'Switch']) if final_decision == 'Stick': win = 1 if initial_choice == car_location else 0 else: switch_to = [door for door in doors if door != initial_choice and door != monty_reveal][0]
win = 1 if switch_to == car_location else 0 results.append([i+1, initial_choice, monty_reveal, car_location, final_decision, win]) return pd.DataFrame(results, columns=['trial', 'initial_choice', 'monty_reveal', 'actual_car_location', 'final_decision', 'win']) df = simulate_monty_hall(1000) df.to_csv('monty_hall_trials.csv', index=False) The dataset contains six columns: trial: The trial number. initial_choice: The initial door chosen by the contestant. monty_reveal: The door Monty reveals to have a goat. actual_car_location: The door behind which the car is actually located. final_decision: The contestant's final decision, either "Stick" or "Switch". win: Whether the contestant won the car (1 for win, 0 for lose). Requirements: 1. Data Loading and Preprocessing: Load the dataset monty_hall_trials.csv into a Pandas DataFrame. Check for any missing or inconsistent data entries and handle them. Display a summary of the dataset. 2. Simulation Analysis: Calculate the empirical probability of winning the car for both strategies: sticking and switching. 3. Visualization: Plot a bar chart comparing the winning probabilities for both strategies. Ensure the graph is appropriately labeled with a relevant title and annotations. 4. Interpretation: Discuss the empirical results in the context of the theoretical probabilities. Offer insights into the optimal strategy for a contestant based on the simulation results. Evaluation Criteria: Correctness and efficiency of the Python code. Proper handling and preprocessing of the dataset. Accurate calculation and interpretation of empirical probabilities. Quality and clarity of visualizations. Insightful interpretations and conclusions regarding the Monty Hall problem. Question 2: Poisson Process Analysis of Website Hits Background: A Poisson process is a mathematical model for events that happen at random points in time and space, where the average rate of occurrence is constant and known. A common application of this process is in modeling the number of times a website is accessed over a given time interval. Scenario: You are a data analyst at a tech company. The company's main website has been receiving hits, and you suspect that the hits can be modeled as a Poisson process. Your task is to analyze the website hits data and verify if it indeed follows the Poisson distribution. Dataset: 'website_hits.csv' In [ ]: import pandas as pd import numpy as np
np.random.seed(12345) # Generating hits using Poisson distribution # Assuming mean hits per hour is 6 hits_per_hour = np.random.poisson(lam=6, size=24) time_intervals = [f"{i}-{i+1}" for i in range(24)] df = pd.DataFrame({ 'time_interval': time_intervals, 'hits': hits_per_hour }) df.to_csv('website_hits.csv', index=False) The dataset contains two columns: time_interval: Represents hourly intervals over a 24-hour period (e.g., "0-1" represents the interval from midnight to 1 AM). hits: The number of website hits recorded during the corresponding time interval. Requirements: 1. Data Loading and Preprocessing: Load the dataset website_hits.csv into a Pandas DataFrame. Check for any missing or inconsistent data entries and handle them. Display the basic statistics of the dataset. 2. Poisson Distribution Fitting: Calculate the mean hit rate from the data. Using the calculated mean, generate the expected hit frequencies for each hour if the process follows a Poisson distribution. 3. Visualization: Plot a bar chart showing both the observed and expected hits for each hourly interval. The bars for observed and expected hits should be side-by-side for comparison. Ensure the graph is properly labeled with a relevant title, legend, and annotations. 4. Hypothesis Testing: Conduct a goodness-of-fit test (e.g., chi-squared test) to determine if the observed hits significantly differ from a Poisson distribution with the calculated mean rate. 5. Interpretation: Discuss the results of the visualization and hypothesis test. Provide insights and recommendations to the company based on your findings. Evaluation Criteria: Correctness and efficiency of the Python code. Proper handling and preprocessing of the dataset. Accurate fitting of the Poisson distribution and calculation of expected frequencies. Quality and clarity of visualizations. Thoroughness of hypothesis testing. Insightful interpretations and conclusions drawn from the analysis. Question 3: Bayesian Analysis of Product Review Sentiments Background: Bayesian statistics is a branch of probability theory that applies probability to
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help