hw4_440

.docx

School

Purdue University *

*We aren’t endorsed by this school

Course

440

Subject

Computer Science

Date

Feb 20, 2024

Type

docx

Pages

Uploaded by BailiffKangaroo6144

Name: Ankush Maheshwari Student ID: 0032646352 Purdue University (Spring 2023) CS44000: Large-scale Data Analytics Homework 1 IMPORTANT:  Upload a pdf file with answers to Gradescope.  Please use the either the latex template or word template to write down your answers and generate a pdf file. o Latex template: https://www.cs.purdue.edu/homes/csjgwang/CS440/template.tex o Word template: https://www.cs.purdue.edu/homes/csjgwang/CS440/template.docx Problem Score 1 2 3 4 5 Total 1

Problem 1 1a) We can solve this problem by using the Hadoop MapReduce approach. We make a map function that reads each document and emits key-value pairs, where the key is a word and value is the document ID it belongs to. We will do this for all words in all documents. We sort the key value pairs by the word and group them together to get all the document IDs associated with a word. The reduce function receives the sorted key-value pairs, combines unique values (document IDs) for each word key, and emits the final output. We use the set function to ensure that the document IDs associated with each word are unique. 1b) Map(String docID, String content): words = content.split() # Split document content into words for word in words: emit(word, docID) # Emit (word, documentID) as key-value pair Reduce(String word, List<documentIDs>): unique_docs = Set() # Use a set to store unique document IDs for docID in documentIDs: unique_docs.add(docID) # Add document ID to the set format_docs = join(unique_docs, ‘, ‘) # Add comma between docs for output format emit(word + ‘: ‘ + format_docs) # Emit word and set of document IDs 2

Problem 2 2a) Databases: Write-ahead Logging (WAL): Databases employ transaction logs and write-ahead logging mechanisms. When changes occur, they are first written to a log file before being applied to the actual database. In case of a crash, the system can use the log to replay the operations and restore the database to a consistent state. The log is usually much smaller than the data, and there are two types of logs: redo (contains new data) and undo (contains old data). Depending on the DBMS, we may use three approaches for logging: UNDO only, REDO only, both UNDO and REDO. Pros: - Provides ACID (Atomicity, Consistency, Isolation, Durability) properties. - Ensures data consistency by logging transactions before applying changes. - Allows for point-in-time recovery. Cons: - Overhead of maintaining logs can impact performance. - Recovery might be slower for large databases due to log replay. Hadoop: Replication and Redundancy: Hadoop handles failure by replicating data across multiple nodes in a cluster. HDFS (Hadoop Distributed File System) replicates data blocks across different nodes, ensuring redundancy. When a node fails, Hadoop can retrieve the data from other replicas. Pros: - Fault tolerance through data replication. - No single point of failure due to data redundancy. - Parallel processing allows for continued execution despite node failures. Cons: 3

- High replication can lead to increased storage requirements. - Less efficient for scenarios with frequent small writes due to replication overhead. 2b) Hadoop: (copied from above as the same points apply here) Replication and Redundancy: Hadoop handles failure by replicating data across multiple nodes in a cluster. HDFS (Hadoop Distributed File System) replicates data blocks across different nodes, ensuring redundancy. When a node fails, Hadoop can retrieve the data from other replicas. Pros: - Fault tolerance through data replication. - No single point of failure due to data redundancy. - Parallel processing allows for continued execution despite node failures. Cons: - High replication can lead to increased storage requirements. - Less efficient for scenarios with frequent small writes due to replication overhead. Spark: RDD Lineage and Resilient Distributed Datasets (RDDs): Spark employs RDD lineage, which is a directed acyclic graph (DAG) of operations. Spark stores information about how to recreate RDDs from the original data using transformations. In case of failure, Spark uses this lineage information to recompute lost RDD partitions. This ensures the retrieval of data in a stable state. Pros: - Provides fault tolerance by tracking the sequence of operations to rebuild RDDs. - Allows for in-memory computation with efficient recovery. 4

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version