Sample-Finals (2)

.pdf

School

University of California, Riverside *

*We aren’t endorsed by this school

Course

217

Subject

Computer Science

Date

Jan 9, 2024

Type

pdf

Pages

Uploaded by USSDD95

Name ______________________________________ Sample Final Exam 1 1. Matrix Multiplication Assume a tiled matrix multiplication that handles boundary conditions. Assume that we use 32X32 tiles to process square matrices of 1,000 X 1,000. a. How many thread blocks are launched? ceil(1000/32) = 32 1,024 thread blocks b. How many warps, in the entire kernel, will have control divergence due to handling boundary conditions when writing to the output matrix? 1,000 c. What are the benefits of tiling compared to a simple matrix multiplication? Please keep your answer short and concise. Tiling utilizes the shared memory to reduce the number of global memory access, resulting in better performance. 1. Matrix Multiplication [20 points total] a. [5 points] For an untiled matrix-matrix multiplication kernel, assume we have an M and N matrix of size 32 x 32. How many total global memory loads are issued to matrix M? 32 x 32 x 32 = 32,768 Every element in M is accessed by 32 threads. b. [4 points] For the output matrix P, how many stores to global memory are issued? 32 x 32 = 1,024

Name ______________________________________ Sample Final Exam 2 c. [5 points] For our tiled matrix-matrix multiplication kernel, if we use a 32X32 tile, how many global memory loads are issued to matrix M per tile ? 32 x 32 = 1,024 d. [6 points] For the tiled single-precision (32-bit) matrix multiplication kernel, assume that the tile size is 32X32 and the system has a DRAM burst size of 128 bytes. How many DRAM bursts will be delivered to the processor as a result of loading one M-matrix tile by a thread block? Assume that the tile being loaded is completely within the range of the M-matrix. (32 x 32 x 4 bytes)/128 bytes = 32 DRAM bursts 2. [20 points] For the following, explain how it could harm performance and possible ways the program can be modified to reduce this effect. Please be specific. a. [10 points] The application needs to access global memory to get one value for every operation. How this harms performance: Memory access is slow. Can cause stalls in computation and bottleneck the memory. Technique/change that could reduce this effect: Tiling is a possible solution. Using shared memory to cache data and maximize reuse.

Name ______________________________________ Sample Final Exam 3 2. Atomics [8 points] Atomic operations perform a read-modify-write operation all in a single cycle. In the following atomic instruction: atomicAdd(&Sum, 1) What is the instruction: Reading? The value of Sum Modifying? Sum + 1 Writing? Writing “Sum + 1” to Sum Why do we need atomic operations? (Keep answer to 1-2 sentences.) To avoid race conditions when multiple threads need to write to the same memory location 3. Race Conditions [10 points] Assume I have two threads running the following instructions with Mem[x] = 2 initially: thread1: Old <- Mem[x] thread2: Old <- Mem[x] New <- Old + 5 New <- Old + 2 Mem[x] <- New Mem[x] <- New a. What are the possible values of Old? 2, 4, 7 b. What are the possible values of Mem[x]? 2, 4, 7, 9 c. Assume the value to be incremented (5 and 2) is stored in a variable, value . What is the corresponding atomicAdd call to ensure this example works? atomicAdd(&(Mem[x]), value);

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version