problem14

.pdf

School

Georgia Institute Of Technology *

*We aren’t endorsed by this school

Course

CS6040

Subject

Computer Science

Date

Dec 6, 2023

Type

pdf

Pages

15

Uploaded by ChefStraw5566

Report
11/28/23, 8:13 PM problem14 file:///Users/dannie/Downloads/pmt1-sample-solutions-su21/problem14-sample-solutions.html 1/15 Problem 14: Scraping data from "FiveThirtyEight" There are a ton of fun interactive visualizations at the website, FiveThirtyEight (http://fivethirtyeight.com). For example, consider the one that tracks the US President's approval ratings: https://projects.fivethirtyeight.com/trump-approval-ratings/ (https://projects.fivethirtyeight.com/trump-approval- ratings/) Here is a screenshot of the interactive graph it contains: In it, you can select each day ("movable cursor") and get information about the approval ratings for that day.
11/28/23, 8:13 PM problem14 file:///Users/dannie/Downloads/pmt1-sample-solutions-su21/problem14-sample-solutions.html 2/15 As it turns out, this visualization is implemented in JavaScript and all of the individual data items are embedded within the web page itself. For example, here is a 132-page PDF file, which is the source code for the web page taken on September 6, 2018: PDF file (https://cse6040.gatech.edu/datasets/538-djt-pop/2018-09-06.pdf). The raw data being rendered in the visualization starts on page 50. Of course, that means you can use your Python-fu to try to extract this data for your own purposes! Indeed, that is your task for this problem. Although the data in this problem comes from an HTML file with embedded JavaScript, you do not need to know anything about HTML or JavaScript to solve this problem. It is purely an exercise of rudimentary Python and computational problem solving. Reading the raw HTML file Let's read the raw contents of the FiveThirtyEight approval ratings page (i.e., the same contents as the PDF) into a variable named raw_html . Like the groceries problem in Notebook 2, this cell contains a bunch of code for getting the data file you need, which you can ignore.
11/28/23, 8:13 PM problem14 file:///Users/dannie/Downloads/pmt1-sample-solutions-su21/problem14-sample-solutions.html 3/15 In [1]: def download(url, local_file, overwrite= False ): import os , requests if not os.path.exists(local_file) or overwrite: print("Downloading: {} ...".format(url)) r = requests.get(url) with open(local_file, 'wb') as f: f.write(r.content) return True return False # File existed already def get_checksum(local_file): import io , hashlib with io.open(local_file, 'rb') as f: body = f.read() body_checksum = hashlib.md5(body).hexdigest() return body_checksum def download_or_load_locally(file, local_dir="", url_base= None , checks um= None ): if url_base is None : url_base = "https://cse6040.gatech.edu/datase ts/" local_file = " {}{} ".format(local_dir, file) remote_url = " {}{} ".format(url_base, file) download(remote_url, local_file) if checksum is not None : body_checksum = get_checksum(local_file) assert body_checksum == checksum, \ "Downloaded file ' {} ' has incorrect checksum: ' {} ' instead of ' {} '".format(local_file, body_checksum, checksum) print("' {} ' is ready!".format(file)) def on_vocareum(): import os return os.path.exists('.voc') if on_vocareum(): URL_BASE = None DATA_PATH = "./resource/asnlib/publicdata/538-djt-pop/" else : URL_BASE = "https://cse6040.gatech.edu/datasets/538-djt-pop/" DATA_PATH = "" datasets = {'2018-09-06.html': '291a7c1cbf15575a48b0be8d77b7a1d6'} for filename, checksum in datasets.items(): download_or_load_locally(filename, url_base=URL_BASE, local_dir=DA TA_PATH, checksum=checksum) with open(' {}{} '.format(DATA_PATH, '2018-09-06.html')) as fp: raw_html = fp.read() print(" \n (All data appears to be ready.)")
11/28/23, 8:13 PM problem14 file:///Users/dannie/Downloads/pmt1-sample-solutions-su21/problem14-sample-solutions.html 4/15 File snippets. Run the following code cell. It takes the raw_html string and prints the substring just around the start of the raw data you'll need, i.e., starting at page 50 of the PDF: In [2]: sample_offset, sample_len = 69950, 1500 print(raw_html[sample_offset:sample_offset+sample_len]) Run the following code cell to see the end of the raw data region. '2018-09-06.html' is ready! (All data appears to be ready.) thPrefix="/trump-approval-ratings/"; var subgroup="All polls"; var showMoreCutoff=5; var approval=[{"date":"2017-01-23","future":false,"subgroup":"All poll s","approve_estimate":"45.46693","approve_hi":"50.88971","approve_l o":"40.04416","disapprove_estimate":"41.26452","disapprove_hi":"46.687 29","disapprove_lo":"35.84175"},{"date":"2017-01-24","future":false,"s ubgroup":"All polls","approve_estimate":"45.44264","approve_hi":"50.82 922","approve_lo":"40.05606","disapprove_estimate":"41.87849","disappr ove_hi":"47.26508","disapprove_lo":"36.49191"},{"date":"2017-01-25","f uture":false,"subgroup":"All polls","approve_estimate":"47.76497","app rove_hi":"52.66397","approve_lo":"42.86596","disapprove_estimate":"42. 52911","disapprove_hi":"47.42811","disapprove_lo":"37.63011"},{"dat e":"2017-01-26","future":false,"subgroup":"All polls","approve_estimat e":"44.37598","approve_hi":"48.93261","approve_lo":"39.81936","disappr ove_estimate":"41.06081","disapprove_hi":"45.61743","disapprove_lo":"3 6.50418"},{"date":"2017-01-27","future":false,"subgroup":"All poll s","approve_estimate":"44.13586","approve_hi":"48.70494","approve_l o":"39.56679","disapprove_estimate":"41.67268","disapprove_hi":"46.241 75","disapprove_lo":"37.1036"},{"date":"2017-01-28","future":false,"su bgroup":"All polls","approve_estimate":"43.87527","approve_hi":"48.468 21","approve_lo":"39.28233","disapprove_estimate":"41.91362","disappro ve_hi":"46.50656","disapprove_lo":"37.32067"},{"date":"2017-01-29","fu ture":false,"subgroup":"All
11/28/23, 8:13 PM problem14 file:///Users/dannie/Downloads/pmt1-sample-solutions-su21/problem14-sample-solutions.html 5/15 In [3]: sample_end = 257500 print(raw_html[sample_end:sample_end+sample_len]) Please make the following observations about the file snippets shown above: The raw data of approval ratings begins with the text, 'var approval=[' and ends with a closing square bracket, ']' . No other square brackets appear between these two. Each "data point" or "data record" is encoded in JavaScript Object Notation (JSON), which is essentially the same as a Python dictionary. That is, it is enclosed in curly brackets, {...} and contains a number of key-value pairs. These include the date ( "date":"yyyy-mm-dd" ), approval and disapproval rating estimates ( "approve_estimate":"45.46693" and "disapprove_estimate":"41.26452" ), as well as upper and lower error bounds ( "..._hi" and "..._lo" ). The estimates correspond to the green (approval) and orange (disapproval) lines, and the error bounds form the shaded regions around those lines. Each data record includes a key named "future" . That's because FiveThirtyEight has projected the ratings into the future, so some records correspond to observed values ( "future":false ) while others correspond to extrapolated values ( "future":true ). In addition, for the exercises below, you may assume the data records are encoded in the same way, e.g., the fields appear in the same order and there are no variations in punctuation or whitespace from what you see in the above snippets. ","approve_lo":"29.24131","disapprove_estimate":"51.94407","disapprove _hi":"63.94288","disapprove_lo":"39.94526"},{"date":"2019-05-10","futu re":true,"subgroup":"All polls","approve_estimate":"41.47093","approve _hi":"53.72246","approve_lo":"29.2194","disapprove_estimate":"51.9422 5","disapprove_hi":"63.96438","disapprove_lo":"39.92012"},{"date":"201 9-05-11","future":true,"subgroup":"All polls","approve_estimate":"41.4 719","approve_hi":"53.74633","approve_lo":"29.19748","disapprove_estim ate":"51.94044","disapprove_hi":"63.98589","disapprove_lo":"39.895"}, {"date":"2019-05-12","future":true,"subgroup":"All polls","approve_est imate":"41.47285","approve_hi":"53.77016","approve_lo":"29.17555","dis approve_estimate":"51.93866","disapprove_hi":"64.0074","disapprove_l o":"39.86993"},{"date":"2019-05-13","future":true,"subgroup":"All poll s","approve_estimate":"41.47378","approve_hi":"53.79396","approve_l o":"29.15361","disapprove_estimate":"51.9369","disapprove_hi":"64.0289 2","disapprove_lo":"39.84487"},{"date":"2019-05-14","future":true,"sub group":"All polls","approve_estimate":"41.47469","approve_hi":"53.8177 3","approve_lo":"29.13165","disapprove_estimate":"51.93515","disapprov e_hi":"64.05045","disapprove_lo":"39.81984"}]; </script> <div class="container"> <div id="footer"> <div class="notes"> <p> When the dates of tracking polls from the same pollster overlap, only the most recent version is shown. </p> </div> <div class="additional-credits"> <p>
11/28/23, 8:13 PM problem14 file:///Users/dannie/Downloads/pmt1-sample-solutions-su21/problem14-sample-solutions.html 6/15 Your task: Extracting the approval ratings Exercise 0 (1 point). Recall that the data begins with 'var approval=[...' and ends with a closing square bracket, ']' . Complete the function, extract_approval_raw(html) , below. The input variable, html , is a string corresponding to the raw HTML file. Your function should return the substring beginning immediately after the opening square bracket and up to, but excluding , the last square bracket. It should return exactly that substring from the file, and should not otherwise modify it. While you don't have to use regular expressions for this problem, if you wish to, observe that the cell below imports the re module. In [4]: import re def extract_approval_raw(html): assert isinstance(html, str), "`html` is not a string." ### BEGIN SOLUTION match = re.search(r'var\s+approval\s*=\s*\[([^\]]*)\];', html) if match: return match.groups(0)[0] return '' ### END SOLUTION raw_data = extract_approval_raw(raw_html) print("type(raw_data) == {} (should be a string!) \n ".format(type(raw _data))) print("=== First and last 300 characters === \n{}\n ... \n{} ".forma t(raw_data[:300], raw_data[-300:])) type(raw_data) == <class 'str'> (should be a string!) === First and last 300 characters === {"date":"2017-01-23","future":false,"subgroup":"All polls","approve_es timate":"45.46693","approve_hi":"50.88971","approve_lo":"40.04416","di sapprove_estimate":"41.26452","disapprove_hi":"46.68729","disapprove_l o":"35.84175"},{"date":"2017-01-24","future":false,"subgroup":"All pol ls","approve_estimat ... e_estimate":"51.9369","disapprove_hi":"64.02892","disapprove_lo":"39.8 4487"},{"date":"2019-05-14","future":true,"subgroup":"All polls","appr ove_estimate":"41.47469","approve_hi":"53.81773","approve_lo":"29.1316 5","disapprove_estimate":"51.93515","disapprove_hi":"64.05045","disapp rove_lo":"39.81984"}
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help