hw4_440
.docx
keyboard_arrow_up
School
Purdue University *
*We aren’t endorsed by this school
Course
440
Subject
Computer Science
Date
Feb 20, 2024
Type
docx
Pages
11
Uploaded by BailiffKangaroo6144
Name: Ankush Maheshwari
Student ID: 0032646352
Purdue University (Spring 2023)
CS44000: Large-scale Data Analytics
Homework 1
IMPORTANT:
Upload a pdf file with answers to Gradescope.
Please use the either the latex template or word template to write down your answers and generate
a pdf file.
o
Latex template: https://www.cs.purdue.edu/homes/csjgwang/CS440/template.tex
o
Word template: https://www.cs.purdue.edu/homes/csjgwang/CS440/template.docx
Problem
Score
1
2
3
4
5
Total
1
Problem 1 1a) We can solve this problem by using the Hadoop MapReduce approach. We make a map function that reads each document and emits key-value pairs, where the key is a word and value is the document ID it belongs to. We will do this for all words in all documents. We sort the key value pairs by the word and group them together to get all the document IDs associated with a word.
The reduce function receives the sorted key-value pairs, combines unique values (document IDs) for each
word key, and emits the final output. We use the set function to ensure that the document IDs associated with each word are unique.
1b) Map(String docID, String content):
words = content.split() # Split document content into words
for word in words:
emit(word, docID) # Emit (word, documentID) as key-value pair
Reduce(String word, List<documentIDs>):
unique_docs = Set() # Use a set to store unique document IDs
for docID in documentIDs:
unique_docs.add(docID) # Add document ID to the set
format_docs = join(unique_docs, ‘, ‘) # Add comma between docs for output format
emit(word + ‘: ‘ + format_docs) # Emit word and set of document IDs
2
Problem 2 2a)
Databases:
Write-ahead Logging (WAL): Databases employ transaction logs and write-ahead logging mechanisms. When changes occur, they are first written to a log file before being applied to the actual database. In case
of a crash, the system can use the log to replay the operations and restore the database to a consistent state. The log is usually much smaller than the data, and there are two types of logs: redo (contains new data) and undo (contains old data). Depending on the DBMS, we may use three approaches for logging: UNDO only, REDO only, both UNDO and REDO.
Pros:
-
Provides ACID (Atomicity, Consistency, Isolation, Durability) properties.
-
Ensures data consistency by logging transactions before applying changes.
-
Allows for point-in-time recovery.
Cons:
-
Overhead of maintaining logs can impact performance.
-
Recovery might be slower for large databases due to log replay.
Hadoop:
Replication and Redundancy: Hadoop handles failure by replicating data across multiple nodes in a cluster. HDFS (Hadoop Distributed File System) replicates data blocks across different nodes, ensuring redundancy. When a node fails, Hadoop can retrieve the data from other replicas.
Pros:
-
Fault tolerance through data replication.
-
No single point of failure due to data redundancy.
-
Parallel processing allows for continued execution despite node failures.
Cons:
3
-
High replication can lead to increased storage requirements.
-
Less efficient for scenarios with frequent small writes due to replication overhead.
2b)
Hadoop:
(copied from above as the same points apply here)
Replication and Redundancy: Hadoop handles failure by replicating data across multiple nodes in a cluster. HDFS (Hadoop Distributed File System) replicates data blocks across different nodes, ensuring redundancy. When a node fails, Hadoop can retrieve the data from other replicas.
Pros:
-
Fault tolerance through data replication.
-
No single point of failure due to data redundancy.
-
Parallel processing allows for continued execution despite node failures.
Cons:
-
High replication can lead to increased storage requirements.
-
Less efficient for scenarios with frequent small writes due to replication overhead.
Spark:
RDD Lineage and Resilient Distributed Datasets (RDDs): Spark employs RDD lineage, which is a directed acyclic graph (DAG) of operations. Spark stores information about how to recreate RDDs from the original data using transformations. In case of failure, Spark uses this lineage information to recompute lost RDD partitions. This ensures the retrieval of data in a stable state. Pros:
-
Provides fault tolerance by tracking the sequence of operations to rebuild RDDs.
-
Allows for in-memory computation with efficient recovery.
4
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Questions
https://drive.google.com/file/d/1WvadErM-1ffp8gm2LcdqdMrtZ0gv3fJv/view?usp=sharing
here in that link there is my code can please add comments to my code and describe me what is happening
arrow_forward
PS5: Webscraping
Suggested Solutions
Import BeautifulSoup, json, requesrts, and pandas.
In [ ]: from bs4 import BeautifulSoup
import pandas as pd
import requests
import re
import json
IMDB top 50 rated films.
The following URL, https://www.imdb.com/search/title/?groups=top_250&sort=user_rating, is a link to the top 50 rated films on IMDB. Create a pandas
DataFrame with three columns: Title, Year, and Rating, pulling the data from the webpage.
We can do this in steps. First, get the HTML code that generated the webpage.
In [ ]:
Using the "Inspect Element" tool in a browser, see that each film is displayed in a DIV with the class lister-item. Use BS to find all such elements
and store them in a list called films.
Then, create a list of the title of each film. Notice, by inspecting the HTML, that the title is contained inside of a tag (a link) that is itself inside of a
DIV with class lister-item-content . That is, for each film in the list films, find the div with the class…
arrow_forward
Official Miami Dade College Hon x
/ Content
Take Test: Exam 2- CGS1540C-2 x
Ô https://mdc.blackboard.com/webapps/assessment/take/launch.jsp?course assessment_id= 818475 1&course_id=_207342_
O Imported from Inte.
Remaining Time: 1 hour, 48 minutes, 41 seconds.
* Question Completion Status:
QUESTION 2
During class, I showed you how application programs (e.g., software used by nurses in hospitals) written in different general-purpose programming
languages can use API to send SQL code to SQL servers. Which one of the following was an example of a programming language that I used to illustrate
that idea?
O Cookies
O PHP
O Front-end scripting
O HTML
QUESTION 3
Which of the following statements regarding MYSQL is FALSE?
O It has free and open-source versions
O It is most popular with web applications
O It is owned by Oracle
O It implements SQL in a non-relational structure
Click Save and Submit to save and submit. Click Save All Answers to save all answers.
P Type here to search
a * S O
77°F…
arrow_forward
Exercise 1 - Ms excel
Insert Pivot Table: Your task is to bring up the pivot table. Then remove the grand total and edit the pivot table.
Group Data by Year: The second task is to group the sales amount by the year or months.
Find Running Total by Date: Your objective is to find the running total by date using the data.
Insert a Pie Chart: You will create a pie chart from the data in this problem, you will need to insert slicer to the pivot table. Use the calculated Field to find the sales tax which is 5% of the total sales.
arrow_forward
q1
selection buttons elements are give _______ name(s) to work with them.
the same
different
radio
button
----------------------------
q2
When working with MySQL database, to specify how many characters you want the field to hold, use :
VAR
CHAR()
SMALLINT
TINYTEXT
----------------------------------
q3
mysqli_connect($abd, $abcd, $wyx); is used to
connect to MySQL
connect to php
run username
display errors
------------------------------
q4
what is the php function we used to write on a file
arrow_forward
PYTHON CODE
Using the file Artists.csv (link below) iterate through the list and print out the data in its entirety
Create some queries with the file by searching for all artists who are male and born in America (Nationality = American)
Make another query and print out all the individuals who are not American and female.
Additionally, query the file for all artists who were born before 1900 and print it to the screen
Artists.csv: https://media.githubusercontent.com/media/MuseumofModernArt/collection/master/Artists.csv
arrow_forward
Estem.org/courses/64525/assignments/9460783?module_item_id=18078917
The following information can help you get started:
• Invitation Details: it boils down to when and where
o When: Time and date
• Where: Address
• Invitee List: Name and email
• Name: First Name, or First Name and Last Name
Email: Email address
. Other considerations:
After you complete your invitation, answer the following questions:
1. What type of data are time, date, and place? How are they different from the other data types on the
invite and guest list?
F4
A
Additional information worth including: dress code, directions, gifting, how to contact you.
. How will you know who is showing up? RSVP?
. Is there a theme to your invitation/design?
x
F5
%
F6
F7
DELL
F8
F9
ROMNA
F10
F11
PrtScr
arrow_forward
/* You have to use the following template to submit to Revel. Note: To test the code using the CheckExerciseTool, you will submit entire code. To submit your code to Revel, you must only submit the code enclosed between // BEGIN REVEL SUBMISSION // END REVEL SUBMISSION
https://liveexample.pearsoncmg.com/test/Exercise24_03.txt
// BEGIN REVEL SUBMISSION
/** Add an element to the beginning of the list */
public void addFirst(E e) {
// Write your code here
}
/** Add an element to the end of the list */
public void addLast(E e) {
// Write your code here
}
/** * Add a new element at the specified index in this list The index of the
*head element is 0
*/
public void add(int index, E e) {
// Write your code here
}
/**
* Remove the head node and return the object that is contained in the
* removed node.
*/
public E removeFirst() {
// Write your code here
}
/**
* Remove the last node and return the object that is contained in the
* removed node.
*/ public E removeLast() {
// Write your code here…
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you
COMPREHENSIVE MICROSOFT OFFICE 365 EXCE
Computer Science
ISBN:9780357392676
Author:FREUND, Steven
Publisher:CENGAGE L
Np Ms Office 365/Excel 2016 I Ntermed
Computer Science
ISBN:9781337508841
Author:Carey
Publisher:Cengage
New Perspectives on HTML5, CSS3, and JavaScript
Computer Science
ISBN:9781305503922
Author:Patrick M. Carey
Publisher:Cengage Learning
Related Questions
- https://drive.google.com/file/d/1WvadErM-1ffp8gm2LcdqdMrtZ0gv3fJv/view?usp=sharing here in that link there is my code can please add comments to my code and describe me what is happeningarrow_forwardPS5: Webscraping Suggested Solutions Import BeautifulSoup, json, requesrts, and pandas. In [ ]: from bs4 import BeautifulSoup import pandas as pd import requests import re import json IMDB top 50 rated films. The following URL, https://www.imdb.com/search/title/?groups=top_250&sort=user_rating, is a link to the top 50 rated films on IMDB. Create a pandas DataFrame with three columns: Title, Year, and Rating, pulling the data from the webpage. We can do this in steps. First, get the HTML code that generated the webpage. In [ ]: Using the "Inspect Element" tool in a browser, see that each film is displayed in a DIV with the class lister-item. Use BS to find all such elements and store them in a list called films. Then, create a list of the title of each film. Notice, by inspecting the HTML, that the title is contained inside of a tag (a link) that is itself inside of a DIV with class lister-item-content . That is, for each film in the list films, find the div with the class…arrow_forwardOfficial Miami Dade College Hon x / Content Take Test: Exam 2- CGS1540C-2 x Ô https://mdc.blackboard.com/webapps/assessment/take/launch.jsp?course assessment_id= 818475 1&course_id=_207342_ O Imported from Inte. Remaining Time: 1 hour, 48 minutes, 41 seconds. * Question Completion Status: QUESTION 2 During class, I showed you how application programs (e.g., software used by nurses in hospitals) written in different general-purpose programming languages can use API to send SQL code to SQL servers. Which one of the following was an example of a programming language that I used to illustrate that idea? O Cookies O PHP O Front-end scripting O HTML QUESTION 3 Which of the following statements regarding MYSQL is FALSE? O It has free and open-source versions O It is most popular with web applications O It is owned by Oracle O It implements SQL in a non-relational structure Click Save and Submit to save and submit. Click Save All Answers to save all answers. P Type here to search a * S O 77°F…arrow_forward
- Exercise 1 - Ms excel Insert Pivot Table: Your task is to bring up the pivot table. Then remove the grand total and edit the pivot table. Group Data by Year: The second task is to group the sales amount by the year or months. Find Running Total by Date: Your objective is to find the running total by date using the data. Insert a Pie Chart: You will create a pie chart from the data in this problem, you will need to insert slicer to the pivot table. Use the calculated Field to find the sales tax which is 5% of the total sales.arrow_forwardq1 selection buttons elements are give _______ name(s) to work with them. the same different radio button ---------------------------- q2 When working with MySQL database, to specify how many characters you want the field to hold, use : VAR CHAR() SMALLINT TINYTEXT ---------------------------------- q3 mysqli_connect($abd, $abcd, $wyx); is used to connect to MySQL connect to php run username display errors ------------------------------ q4 what is the php function we used to write on a filearrow_forwardPYTHON CODE Using the file Artists.csv (link below) iterate through the list and print out the data in its entirety Create some queries with the file by searching for all artists who are male and born in America (Nationality = American) Make another query and print out all the individuals who are not American and female. Additionally, query the file for all artists who were born before 1900 and print it to the screen Artists.csv: https://media.githubusercontent.com/media/MuseumofModernArt/collection/master/Artists.csvarrow_forward
- Estem.org/courses/64525/assignments/9460783?module_item_id=18078917 The following information can help you get started: • Invitation Details: it boils down to when and where o When: Time and date • Where: Address • Invitee List: Name and email • Name: First Name, or First Name and Last Name Email: Email address . Other considerations: After you complete your invitation, answer the following questions: 1. What type of data are time, date, and place? How are they different from the other data types on the invite and guest list? F4 A Additional information worth including: dress code, directions, gifting, how to contact you. . How will you know who is showing up? RSVP? . Is there a theme to your invitation/design? x F5 % F6 F7 DELL F8 F9 ROMNA F10 F11 PrtScrarrow_forward/* You have to use the following template to submit to Revel. Note: To test the code using the CheckExerciseTool, you will submit entire code. To submit your code to Revel, you must only submit the code enclosed between // BEGIN REVEL SUBMISSION // END REVEL SUBMISSION https://liveexample.pearsoncmg.com/test/Exercise24_03.txt // BEGIN REVEL SUBMISSION /** Add an element to the beginning of the list */ public void addFirst(E e) { // Write your code here } /** Add an element to the end of the list */ public void addLast(E e) { // Write your code here } /** * Add a new element at the specified index in this list The index of the *head element is 0 */ public void add(int index, E e) { // Write your code here } /** * Remove the head node and return the object that is contained in the * removed node. */ public E removeFirst() { // Write your code here } /** * Remove the last node and return the object that is contained in the * removed node. */ public E removeLast() { // Write your code here…arrow_forward
arrow_back_ios
arrow_forward_ios
Recommended textbooks for you
- COMPREHENSIVE MICROSOFT OFFICE 365 EXCEComputer ScienceISBN:9780357392676Author:FREUND, StevenPublisher:CENGAGE LNp Ms Office 365/Excel 2016 I NtermedComputer ScienceISBN:9781337508841Author:CareyPublisher:CengageNew Perspectives on HTML5, CSS3, and JavaScriptComputer ScienceISBN:9781305503922Author:Patrick M. CareyPublisher:Cengage Learning
COMPREHENSIVE MICROSOFT OFFICE 365 EXCE
Computer Science
ISBN:9780357392676
Author:FREUND, Steven
Publisher:CENGAGE L
Np Ms Office 365/Excel 2016 I Ntermed
Computer Science
ISBN:9781337508841
Author:Carey
Publisher:Cengage
New Perspectives on HTML5, CSS3, and JavaScript
Computer Science
ISBN:9781305503922
Author:Patrick M. Carey
Publisher:Cengage Learning