Lab3

.html

School

Pennsylvania State University *

*We aren’t endorsed by this school

Course

410

Subject

Computer Science

Date

Feb 20, 2024

Type

html

Pages

5

Uploaded by MateWasp883

Report
Instructor: Professor Romit Maulik TAs: Haomiao Ni and Songtao Liu Lab 3: Hashtag Counting and Spark- submit in Cluster Mode The goals of this lab are for you to be able to - Use the RDD transformations filter and sortBy . - Compute hashtag counts for an input data file containing tweets. - Modify PySpark code for local mode into PySpark code for cluster mode. - Request a cluster from ICDS, and run spark-submit in the cluster mode for a big dataset. - Obtain run-time performance for a choice on the number of output partitions for reduceByKey. - Apply the obove to compute hashtag counts for tweets related to Boston Marathon Bombing (gathered on April 17, 2013, two days after the domestic terrorist attack). Total Number of Exercises: 5 Exercise 1: 5 points Exercise 2: 10 points Exercise 3: 10 points Exercise 4: 15 points Exercise 5: 10 points Total Points: 50 points Data for Lab 3 sampled_4_17_tweets.csv : A random sampled of a small set of tweets regarding Boston Marathon Bombing on April 17, 2013. This data is used in the local mode. BMB_4_17_tweets.csv : The entire set of tweets regarding Boston Marathon Bombing on April 17, 2013. This data is used in the cluster mode. Like Lab2, download the data from Canvas into a directory for the lab (e.g., Lab3) under your home directory. Items to submit for Lab 3 Completed Jupyter Notebook (HTML format) .py file used for cluster mode log file for a successful run in the cluster mode a screen shot of the ls -al command in the output directory for a successful run in the cluster mode.
Due: 11.59 PM, Sept 10, 2023, 5 bonus points if you submit by 11.59 PM, Sept 8, 2023 Like Lab 2, the first thing we need to do in each Jupyter Notebook running pyspark is to import pyspark first. In [1]: import pyspark In [2]: from pyspark import SparkContext Like Lab 2, ww create a Spark Context object. Note: We use "local" as the master parameter for SparkContext in this notebook so that we can run and debug it in ICDS Jupyter Server. However, we need to remove "master="local", later when you convert this notebook into a .py file for running it in the cluster mode. In [3]: sc=SparkContext(appName="Lab3") sc Out[3]: SparkContext Spark UI Version v3.4.1 Master local AppName Lab3 Exercise 1 (5 points) Add your name below Answer for Exercise 1 Your Name: Ruthvik Uttarala Exercise 2 (10 points) Complete the path and run the code below to read the file "sampled_4_17_tweets.csv" from your Lab3 directory. For cluster mode execution - change this path to "BMB_4_17_tweets.csv" for a big data cluster job In [4]: tweets_RDD = sc.textFile("/storage/home/rpu5040/Lab3/BMB_4_17_tweets.csv") tweets_RDD Out[4]: /storage/home/rpu5040/Lab3/sampled_4_17_tweets.csv MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help