lab06_simulation (1)

.html

School

Temple University *

*We aren’t endorsed by this school

Course

1013

Subject

Computer Science

Date

Dec 6, 2023

Type

html

Pages

Uploaded by samzahroun

Lab 6: Simulation ¶ Elements of Data Science Welcome to Module 2 and lab 6! This week, we will go over conditionals and iteration, and introduce the concept of randomness. All of this material is covered in Chapter 9 and Chapter 10 of the online Inferential Thinking textbook. First, set up the tests and imports by running the cell below. In [343]: name = "Sam" In [344]: import numpy as np from datascience import * %matplotlib inline import matplotlib.pyplot as plt plt.style.use('ggplot') import os user = os.getenv('JUPYTERHUB_USER') from gofer.ok import check 1. Sampling ¶ 1.1 Dungeons and Dragons and Sampling ¶ In the game Dungeons & Dragons, each player plays the role of a fantasy character. A player performs actions by rolling a 20-sided die, adding a "modifier" number to the roll, and comparing the total to a threshold for success. The modifier depends on her character's competence in performing the action. For example, suppose Alice's character, a barbarian warrior named Roga, is trying to knock down a heavy door. She rolls a 20-sided die, adds a modifier of 11 to the result (because her character is good at knocking down doors), and succeeds if the total is greater than 15 . A Medium posting discusses probability in the context of Dungeons and Dragons https://towardsdatascience.com/understanding-probability-theory-with-dungeons-and- dragons-a36bc69aec88 Question 1.1 Write code that simulates that procedure. Compute three values: the result of Alice's roll (roll_result), the result of her roll plus Roga's modifier (modified_result), and a boolean value indicating whether the action succeeded (action_succeeded). Do not fill in any of the results manually; the entire simulation should happen in code. Hint: A roll of a 20-sided die is a number chosen uniformly from the array make_array(1, 2, 3, 4, ..., 20). So a roll of a 20-sided die plus 11 is a number chosen uniformly from that array, plus 11. In [345]: possible_rolls = np.arange(1,21,1) roll_result = np.random.choice(possible_rolls)

modified_result = roll_result + 11 action_succeeded = modified_result > 15 # The next line just prints out your results in a nice way # once you're done. You can delete it if you want. print(f"On a modified roll of {modified_result}, Alice's action {'succeeded' if action_succeeded else 'failed'}") On a modified roll of 24, Alice's action succeeded In [346]: check('tests/q1.1.py') Out[346]: All tests passed! Question 1.2 Run your cell 7 times to manually estimate the chance that Alice succeeds at this action. (Don't use math or an extended simulation.). Your answer should be a fraction. In [347]: rough_success_chance = 7/7 In [348]: check('tests/q1.2.py') Out[348]: All tests passed! Suppose we don't know that Roga has a modifier of 11 for this action. Instead, we observe the modified roll (that is, the die roll plus the modifier of 11) from each of 7 of her attempts to knock down doors. We would like to estimate her modifier from these 7 numbers. Question 1.3 Write a Python function called simulate_observations . It should take two arguments, the modifier and num_oobservations, and it should return an array of num_observations. Each of the numbers should be the modified roll from one simulation. Then , call your function once to compute an array of 7 simulated modified rolls. Name that array observations . In [349]: modifier = 11 num_observations = 7 def simulate_observations(modifier, num_observations): """Produces an array of 7 simulated modified die rolls""" possible_rolls = np.arange(1,21) obs = make_array() for num in np.arange(num_observations): obs = np.append(obs, (np.random.choice(possible_rolls) + modifier)) return (obs) observations = simulate_observations(modifier, num_observations) observations Out[349]: array([ 14., 20., 30., 27., 31., 29., 18.]) In [350]: check('tests/q1.3.py') Out[350]: All tests passed! Question 1.4 Draw a histogram to display the probability distribution of the modified

rolls we might see. Check with a neighbor or a CA to make sure you have the right histogram. Carry this out again using 100 rolls. In [351]: num_observations = 100 def roll_sim(mod, num_observtions): """Produces the probability distribution of the seven observations from the preveous code cell.""" possible_rolls = np.arange(1,21) modrolls = np.random.choice(possible_rolls, num_observations) + mod return modrolls In [352]: # We suggest using these bins. roll_bins = np.arange(1, modifier+2+20, 1) roll_bins plt.hist(roll_sim(11,100)) Out[352]: (array([ 12., 14., 10., 9., 9., 12., 6., 7., 13., 8.]), array([ 12. , 13.9, 15.8, 17.7, 19.6, 21.5, 23.4, 25.3, 27.2, 29.1, 31. ]), <BarContainer object of 10 artists>) Estimate the modifier ¶ Now let's imagine we don't know the modifier and try to estimate it from observations. One straightforward (but clearly suboptimal) way to do that is to find the smallest total roll, since the smallest roll on a 20-sided die is 1, which is roughly 0. Use a random number for modifier to start and keep this value through the next questions. We will also generate 100 rolls based on the below unknown modifier. Question 1.5 Using that method, estimate modifier from observations. Name your

estimate min_estimate. In [353]: modifier = np.random.randint(1,20) # Generates a random integer modifier from 1 to 20 inclusive observations = simulate_observations(modifier, num_observations) min_roll = min(observations) min_estimate = min_roll -1 min_estimate Out[353]: 1.0 In [354]: check('tests/q1.5.py') Out[354]: All tests passed! Estimate the modifier based on the mean of observations. ¶ Question 1.6 Figure out a good estimate based on that quantity. Then, write a function named mean_based_estimator that computes your estimate. It should take an array of modified rolls (like the array observations) as its argument and return an estimate (single number)of the modifier based on those numbers contianed in the array. In [355]: def mean_based_estimator(obs): """Estimate the roll modifier based on observed modified rolls in the array nums.""" return int(round(np.mean(obs)- 11)) # Here is an example call to your function. It computes an estimate # of the modifier from our observations. mean_based_estimate = mean_based_estimator(observations) print(mean_based_estimate) 1 In [356]: check('tests/q1.6.py') Out[356]: All tests passed! Question 1.7 Construct a histogram and compare to above estimates, are they consistent? What is your best estimate of the random modiifer based on the above, without examining the value? In [357]: plt.hist(observations, bins = roll_bins) # Use to plot histogram of an array of 100 modified rolls estimated_modifier = mean_based_estimator(observations)

In [358]: check('tests/q1.7.py') Out[358]: All tests passed! 2. Sampling and GC content of DNA sequence ¶ DNA within a cell contains codes or sequences for the ultimate synthesis of proteins. In DNA is made up of four types of nucleotides, guanine (G), cytosine (C), adenine (A), and thymine (T) connected in an ordered sequence. These nucleotides on a single strand pair with complimentary nucleotides on a second strand, G pairs with C and A with T. Regions of DNA code for RNA which ultimately directs protein synthesis and these segments are known as genes and these segments often have higher GC content. Here we will sample 10 nuclotide segments of a DNA sequence and determine the GC content of these DNA segments. See DNA sequnce basics and GC Content details . Our goal is to sample portions (10 nucelotides) of the sequence and determine the relative content of guanine (G) and cytosine (C) to adenine (A) and thymine (T) In [359]: # DNA sequence we will examine, a string seq = "CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG \ AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG \ CCGCCTCGGGAGCGTCCATGGCGGGTTTGAACCTCTAGCCCGGCGCAGTTTGGGCGCCAAGCCATATGAA \ AGCATCACCGGCGAATGGCATTGTCTTCCCCAAAACCCGGAGCGGCGGCGTGCTGTCGCGTGCCCAATGA \ ATTTTGATGACTCTCGCAAACGGGAATCTTGGCTCTTTGCATCGGATGGAAGGACGCAGCGAAATGCGAT \ AAGTGGTGTGAATTGCAAGATCCCGTGAACCATCGAGTCTTTTGAACGCAAGTTGCGCCCGAGGCCATCA \ GGCTAAGGGCACGCCTGCTTGGGCGTCGCGCTTCGTCTCTCTCCTGCCAATGCTTGCCCGGCATACAGCC \ AGGCCGGCGTGGTGCGGATGTGAAAGATTGGCCCCTTGTGCCTAGGTGCGGCGGGTCCAAGAGCTGGTGT \ TTTGATGGCCCGGAACCCGGCAAGAGGTGGACGGATGCTGGCAGCAGCTGCCGTGCGAATCCCCCATGTT \ GTCGTGCTTGTCGGACAGGCAGGAGAACCCTTCCGAACCCCAATGGAGGGCGGTTGACCGCCATTCGGAT \ GTGACCCCAGGTCAGGCGGGGGCACCCGCTGAGTTTACGC" # LCBO-Prolactin precursor-Bovine seq

Out[359]: 'CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG CCGCCTCGGGAGCGTCCATGGCGGGTTTGAACCTCTAGCCCGGCGCAGTTTGGGCGCCAAGCCATATGAA AGCATCACCGGCGAATGGCATTGTCTTCCCCAAAACCCGGAGCGGCGGCGTGCTGTCGCGTGCCCAATGA ATTTTGATGACTCTCGCAAACGGGAATCTTGGCTCTTTGCATCGGATGGAAGGACGCAGCGAAATGCGAT AAGTGGTGTGAATTGCAAGATCCCGTGAACCATCGAGTCTTTTGAACGCAAGTTGCGCCCGAGGCCATCA GGCTAAGGGCACGCCTGCTTGGGCGTCGCGCTTCGTCTCTCTCCTGCCAATGCTTGCCCGGCATACAGCC AGGCCGGCGTGGTGCGGATGTGAAAGATTGGCCCCTTGTGCCTAGGTGCGGCGGGTCCAAGAGCTGGTGT TTTGATGGCCCGGAACCCGGCAAGAGGTGGACGGATGCTGGCAGCAGCTGCCGTGCGAATCCCCCATGTT GTCGTGCTTGTCGGACAGGCAGGAGAACCCTTCCGAACCCCAATGGAGGGCGGTTGACCGCCATTCGGAT GTGACCCCAGGTCAGGCGGGGGCACCCGCTGAGTTTACGC' Question 2.1A Run the first two code cells below to see how substrings are extracted and how a character can be counted within a substring. Use the same strategy to determine GC content as fraction of the total in the first 10 nucleotides in the larger sequence above, seq In [360]: # Example A samplesize = 4 # Use this short sequence in this example seq0 = 'GTGAAAGATT' # How to get a substring seq0[0:samplesize] Out[360]: 'GTGA' In [361]: # Example B # How to count the number of times 'A' appears in sequence seq0[0:samplesize].count('A') Out[361]: 1 In [362]: GCcount = seq[0:10].count('G') + seq[0:10].count('C') GCfraction = (GCcount)/(seq[0:10].count('A') + seq[0:10].count('T') + GCcount) GCfraction Out[362]: 0.5 Lists ¶ Below we assemble a list and append an additional entry, 0.7. A useful strategy in creating your function In [363]: gc = [] gc.append(0.8) gc.append(0.7) gc Out[363]: [0.8, 0.7] Fill a list with 30 random G, C, T, A nucleotides ¶ use iteration and np.random.choice In [364]: my_sim_seq = []

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version