Lab3

.html

School

Pennsylvania State University *

*We aren’t endorsed by this school

Course

410

Subject

Computer Science

Date

Feb 20, 2024

Type

html

Pages

Uploaded by MateWasp883

Instructor: Professor Romit Maulik ¶ TAs: Haomiao Ni and Songtao Liu ¶ Lab 3: Hashtag Counting and Spark- submit in Cluster Mode ¶ The goals of this lab are for you to be able to ¶ - Use the RDD transformations filter and sortBy . ¶ - Compute hashtag counts for an input data file containing tweets. ¶ - Modify PySpark code for local mode into PySpark code for cluster mode. ¶ - Request a cluster from ICDS, and run spark-submit in the cluster mode for a big dataset. ¶ - Obtain run-time performance for a choice on the number of output partitions for reduceByKey. ¶ - Apply the obove to compute hashtag counts for tweets related to Boston Marathon Bombing (gathered on April 17, 2013, two days after the domestic terrorist attack). ¶ Total Number of Exercises: 5 ¶ • Exercise 1: 5 points • Exercise 2: 10 points • Exercise 3: 10 points • Exercise 4: 15 points • Exercise 5: 10 points Total Points: 50 points ¶ Data for Lab 3 ¶ • sampled_4_17_tweets.csv : A random sampled of a small set of tweets regarding Boston Marathon Bombing on April 17, 2013. This data is used in the local mode. • BMB_4_17_tweets.csv : The entire set of tweets regarding Boston Marathon Bombing on April 17, 2013. This data is used in the cluster mode. • Like Lab2, download the data from Canvas into a directory for the lab (e.g., Lab3) under your home directory. Items to submit for Lab 3 ¶ • Completed Jupyter Notebook (HTML format) • .py file used for cluster mode • log file for a successful run in the cluster mode • a screen shot of the ls -al command in the output directory for a successful run in the cluster mode.

Due: 11.59 PM, Sept 10, 2023, 5 bonus points if you submit by 11.59 PM, Sept 8, 2023 ¶ Like Lab 2, the first thing we need to do in each Jupyter Notebook running pyspark is to import pyspark first. ¶ In [1]: import pyspark In [2]: from pyspark import SparkContext Like Lab 2, ww create a Spark Context object. ¶ • Note: We use "local" as the master parameter for SparkContext in this notebook so that we can run and debug it in ICDS Jupyter Server. However, we need to remove "master="local", later when you convert this notebook into a .py file for running it in the cluster mode. In [3]: sc=SparkContext(appName="Lab3") sc Out[3]: SparkContext Spark UI Version v3.4.1 Master local AppName Lab3 Exercise 1 (5 points) Add your name below ¶ Answer for Exercise 1 ¶ • Your Name: Ruthvik Uttarala Exercise 2 (10 points) ¶ Complete the path and run the code below to read the file "sampled_4_17_tweets.csv" from your Lab3 directory. ¶ For cluster mode execution - change this path to "BMB_4_17_tweets.csv" for a big data cluster job ¶ In [4]: tweets_RDD = sc.textFile("/storage/home/rpu5040/Lab3/BMB_4_17_tweets.csv") tweets_RDD Out[4]: /storage/home/rpu5040/Lab3/sampled_4_17_tweets.csv MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version