Sample-Finals (2)
.pdf
keyboard_arrow_up
School
University of California, Riverside *
*We aren’t endorsed by this school
Course
217
Subject
Computer Science
Date
Jan 9, 2024
Type
Pages
6
Uploaded by USSDD95
Name ______________________________________ Sample Final Exam 1 1.
Matrix Multiplication Assume a tiled matrix multiplication that handles boundary conditions. Assume that we use 32X32 tiles to process square matrices of 1,000 X 1,000. a.
How many thread blocks are launched? ceil(1000/32) = 32 1,024 thread blocks
b.
How many warps, in the entire kernel, will have control divergence due to handling boundary conditions when writing to the output matrix? 1,000
c.
What are the benefits of tiling compared to a simple matrix multiplication? Please keep your answer short and concise. Tiling utilizes the shared memory to reduce the number of global memory access, resulting in better performance. 1.
Matrix Multiplication [20 points total] a.
[5 points] For an untiled matrix-matrix multiplication kernel, assume we have an M and N matrix of size 32 x 32. How many total global memory loads are issued to matrix M? 32 x 32 x 32 = 32,768 Every element in M is accessed by 32 threads. b.
[4 points] For the output matrix P, how many stores to global memory are issued? 32 x 32 = 1,024
Name ______________________________________ Sample Final Exam 2 c.
[5 points] For our tiled matrix-matrix multiplication kernel, if we use a 32X32 tile, how many global memory loads are issued to matrix M
per tile
? 32 x 32 = 1,024 d.
[6 points] For the tiled single-precision (32-bit)
matrix multiplication kernel, assume that the tile size is 32X32 and the system has a DRAM burst size of 128 bytes. How many DRAM bursts will be delivered to the processor as a result of loading one M-matrix tile by a thread block? Assume that the tile being loaded is completely within the range of the M-matrix. (32 x 32 x 4 bytes)/128 bytes = 32 DRAM bursts 2.
[20 points] For the following, explain how it could harm performance and possible ways the program can be modified to reduce this effect. Please be specific. a.
[10 points] The application needs to access global memory to get one value for every operation. How this harms performance: Memory access is slow. Can cause stalls in computation and bottleneck the memory. Technique/change that could reduce this effect: Tiling is a possible solution. Using shared memory to cache data and maximize reuse.
Name ______________________________________ Sample Final Exam 3 2.
Atomics [8 points] Atomic operations perform a read-modify-write operation all in a single cycle. In the following atomic instruction: atomicAdd(&Sum, 1) What is the instruction: Reading? The value of Sum Modifying? Sum + 1 Writing? Writing “Sum + 1” to Sum
Why do we need atomic operations? (Keep answer to 1-2 sentences.) To avoid race conditions when multiple threads need to write to the same memory location 3.
Race Conditions [10 points] Assume I have two threads running the following instructions with Mem[x] = 2 initially: thread1: Old <- Mem[x] thread2: Old <- Mem[x] New <- Old + 5 New <- Old + 2 Mem[x] <- New Mem[x] <- New a.
What are the possible values of Old? 2, 4, 7 b.
What are the possible values of Mem[x]? 2, 4, 7, 9 c.
Assume the value to be incremented (5 and 2) is stored in a variable, value
. What is the corresponding atomicAdd call to ensure this example works? atomicAdd(&(Mem[x]), value);
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Questions
Given an array A[0..n-1], write the following CUDA program WITHOUT USING SHARED MEMORY:
Each thread compares and exchanges two items in each iteration, but using only global memory.(a) Use only one block of threads.(b) Use multiple blocks.
arrow_forward
Given an array A[0..n-1], write the following CUDA program using SHARED MEMORY:
Each thread compares and exchanges two items in each iteration, but using only global memory.(a) Use only one block of threads.(b) Use multiple blocks.
arrow_forward
Given an array A[0..n-1], write the following CUDA program:
Each thread compares and exchanges two items in each iteration, but using only global memory. (a) Use only one block of threads.
arrow_forward
Question 3: [9 marks]Suppose we have an array of size n which stores random numbers. Also, we have n threads each of which should sum the numbers stored in the array positions from 0 to m-1 where m is the number of that thread, then stores the result in the position m-1.215, 25 ,17 ,1001, … ,5, -150 0 , 1 , 2 , 3 , … , n-2, n-1Apparently the threads can run independently, but when the array inspected after execution the results were incorrect.a) Why the results were incorrect? What is this problem called? ( 4 marks)b) How can such case be handled in order to always guarantee correct results? Suggest the most appropriate solution for this case and justify your choice. (5 marks)Note: No code or mathematical examples or solutions required in your answer.
arrow_forward
Please help me with this operating systems principles homework
arrow_forward
Topic: OpenMP #pragma omp parallel for and #pragma omp master (Distributed and Parallel Computing Lab)
The master construct denotes a block that is only executed by the master thread. Note that there is no synchronization (implicit barrier) for the master construct. The other threads will skip over this block and continue processing without waiting for the master thread. Write a program that computes the average of a large array using a parallel for construct. While it is running using #pragma omp parallel for construct, also use a master construct (outside the for loop) to keep track of how many iterations have been executed and prints out a progress report.
Q. The following code is what I have written so far, but the ave(rage) value at the end comes as zero, and the number of iteration was only one, which I don't think it reflects what this program is supposed to do. Please, modify my current code to meet the criteria explained above.
#include <omp.h>#include…
arrow_forward
To sort a vector with 1000 elements of random integer numbers:
b) Parallelize it using OpenMP for 2, 4, 8, 16, and 32 threads. Report the speedup over the serial implementation
arrow_forward
Given an array A[0..n-1], write the following CUDA program without using shared memory:
Each thread compares and exchanges two items in each iterations, but use shared memory and multipleblocks.
arrow_forward
Given an array A[0..n-1], write the following CUDA program using SHARED MEMORY:
Each thread compares and exchanges two items in each iterations, but use shared memory and multipleblocks.
arrow_forward
The heart of the recent hit game SimAquarium is a tight loopthat calculates the average position of 256 algae. You areevaluating its cache performance on a machine with a 1,024-byte direct-mapped data cache with 16-byte blocks (B = 16).You are given the following definitions:1 struct algae_position {2 int x;3 int y;4 };56 struct algae_position grid[16][16];7 int total_x = 0, total_y = 0;8 int i, j;You should also assume the following:sizeof(int) = 4.grid begins at memory address 0.The cache is initially empty.The only memory accesses are to the entries of the arraygrid. Variables i, j, total_x , and total_y are storedin registers.Determine the cache performance for the following code:1 for (i = 0; i < 16; i++) {2 for (j = 0; j < 16; j++) {3 total_x += grid[i][j].x;4 }5 }67 for (i = 0; i < 16; i++) {8 for (j = 0; j < 16; j++) {9 total_y += grid[i][j].y;10 }11 }A. What is the total number of reads?B. What is the total number of reads that miss in thecache?C. What is the miss…
arrow_forward
Please answer the following Operating Systems question and its two parts correctly and completely. *If you answer the question's two parts correctly, I will give you a thumbs up. Thanks.
arrow_forward
Assume a system using paged virtual memory with a page size of 512. Physical memory in this system contains fewer than 512
frames.
See the following two code examples:
Example 1:
int arr[512][512]; // Each row occupies exactly 1 page
for (i = 0; i < 512; i++) {
for (j = 0; j< 512; j++) {
cout << arr[i]];
}
Example 2:
int arr[512][512]; // Each row occupies exactly 1 page
for (i = 0; i< 512; i++) {
for (j = 0; j< 512; j++) {
cout << arr[j]0);
}
For each of these examples, give the number of page faults that will occur. Explain the reasoning for your answers.
Use the editor to formot your answer
arrow_forward
Please give me correct solution.
arrow_forward
Given an array A[0..n-1], write the following CUDA program using shared memory:
Each thread splits and merges two subarrays of size n/p in each iterations. Use shared memory and multipleblocks.
arrow_forward
Two threads (A and B) are concurrently running on a dual-core processor that implements a sequentially consis-
tent memory model. Assume that the value at address (R10) is initialized to 0. The instruction st immediate,
(R10) writes an immediate number into the memory address stored in R10.
Thread A (core 1)
1: st 0x1, (R10)
2: ld R1, (R10)
3: st 0x2,
ld R2,
(R10)
4:
(R10)
Thread B (core 2)
1: st 0x3, (R10)
2: ld R3, (R10)
3: st 0x4,
(R10)
4: ld R4, (R10)
After both threads have finished executing, you find that (R1, R2, R3, R4) = (1, 2, 3, 4). How many different
instruction interleavings of the two threads produce this result (please show all possible interleavings)?
arrow_forward
Suppose that when run using a single thread, a particular program spends 20 seconds executing code that cannot be parallelized, and 280 seconds executing code that is highly parallelizable. Using Amdahl’s law, give an upper bound on the speedup (ratio of old to new execution time) that might be achieved for this program by exploiting parallel execution on a highly-parallel computer with thousands of processors/cores.
arrow_forward
Question 8:
A CPU generates 32-bit virtual addresses. The page size is 4 KB. The processor has a translation
look-aside buffer (TLB) which can hold a total of 128 page table entries and is 4-way set
associative. The minimum size of the TLB tag is:
01
02
0 3
04
1. 11 bits
2. 13 bits
3. 15 bits
4. 20 bits
arrow_forward
This a High-Performance Computing Question:
Assume you have the code of a Naïve parallel version of Matrix-Matrix multiplication using CUDA and C++ in this way:(A Naïve parallel version of Matrix-Matrix multiplication using CUDA and C++(Note that the kernel should do the multiplication).Use square matrices.Use 1D execution configurations so that a thread loads a whole. The code has 4 different sizes of matrices over 1000, use 2 different block sizes.)
Provide the code using CUDA and C++ for an OPTIMIZED parallel version of a Matrix-Matrix multiplication with CUDA and by "Varying the size of your computational grid: change number of CUDA threads and blocks ".
Requirements:Compare that results are correct by comparing the results with cubLAS'.Use double-precision for all the programUse square matrices.Calculate the time that it took for the kernel to do the multiplication.Calculate the time that it took since transferring the matrices from host to device up to retrieving the results…
arrow_forward
A 32-bit computer has a data cache memory with 8 KB and lines of 64 bytes. Calculate the hit ratio of this program.
double a[1024], b[1024], c[1024], d[1024];
// A double occupies 8 bytes ,
// the array are consecutively stored
// in memory
for (int i = 0; i < 1024; i++)
a[i] = b[i] + c[i] +d[i];
The cache uses a direct mapped function and write-back policy.
The cache uses a full associative cache with LRU as replacement algorithm.
The cache uses a 2-way associative cache and a 4-way associative cache with LRU as
replacement algorithm.
arrow_forward
Suresh and Ramesh started creating an application where it requires to do some mathematical operations.They are going to deploy it in a server. They wanted to do simultaneous execution of multiple parts ofprograms to utilize CPU time. They decided to create 2 separate threads for operations: num**2 andsqrt(num) with sleep of 100ms each.Write a program containing two threads and each thread should have a sleep of 100 ms. One thread is forcalculating the square of the elements in the array and other is for square roots.Trace the output in the threads only. Use Exception handling to handle and trace the interruptions if they occur.Input: An array. Can use static data. Preferably try to use run() and start().Output:"Thread 1"+ square(num)"Thread 2"+ square root(num)likewise to distinguish the thread status
arrow_forward
Write a C code to perform vector arithmetic: Define 3 vectors A[100], B[100], C[100].
Get n from as a command line argument. Example if n=10, then (./vector 10), and create n processes. (n will be one of Divisors of 100).
Get operation from user: add, sub.
Each process will create a number of threads. Number of threads per process = 100/(10 number of processes). Perform the operation on a chunk of the vector, for example, if n = 10, each
process will create (100/10*10=1) 1 thread to add\sub 10 elements. Use execl to run the add or sub programs
Parent should print A.B.C in a file. (yourname.txt)
For example, n=5, operation sub Partition work equally to each process:
PO create (100/10*5=2) 2 threads →
Thread00 will executes A[0:9] B[0:9]-C [0:9] = Thread01 will executes A[10:19] B[10:19]-C[10:19] =
PI create (100/10*5=2) 2 threads →
Thread 10 will executes A[20:29] = B[20:29]-C [20:29] Thread11 will executes A[30:39] B[30:39]- C[30:39] =
and so on.
no…
arrow_forward
Consider the following processes and their associated threads running on a multiprocessor system:
Process
Thread
Arrival Time
CPU Burst
I/O Burst
Total CPU Time
X
X1
0
4
2
11
X2
3
2
4
5
X3
3
2
4
5
Y
Y1
5
6
2
14
Y2
5
3
5
7
Z
Z1
7
1
1
6
Z2
7
3
2
10
Create a scheduling simulation for these threads on a system with two (2) processors using the First Come First Served (FCFS) algorithm. A thread may migrate from one processor to another if it returns from blocked waiting time and 1) the processor it was running on previously is occupied with other work; and 2) and the other processor is available to execute the thread. If both processors are available when a thread becomes unblocked, it will remain on the processor it was most recently running on.
In this simulation, you will be managing a single ready queue that schedules all processes. When two processes arrive in the ready queue at…
arrow_forward
Q: A digital computer has a memory unit of 64k * 16 and a cache memory of 1k words. The cache uses direct mapping with a block size of 4 words.
i) How many bits are there in the tag, index, block & words fields of the address formats.
ii) How many bits are there in each word of cache?
iii) How many blocks can the cache accommodate?
arrow_forward
Given an array A[0..n-1], write the following versions of CUDA programs with and without using shared memory.
Each thread compares and exchanges two items in each iteration, but using only global memory. (a) Use only one block of threads. (b) Use multiple blocks. Experiment to get best performances.
arrow_forward
VShow all working explaining detailly each step
arrow_forward
HELP WITH PART D AND E PLEASE! Please bold the answers.
arrow_forward
Construct a multi-threaded Java program to search for an element in the randomly initialized input array of size 100 elements and each thread is likely to search for an equally partitioned array. Suppose if you consider two threads, then each thread searches for 50 elements.
arrow_forward
The following page table is for a system with 16-bit virtual and physical addresses and with 4,096-byte pages. The reference bit is set to 1 when the page has been referenced. Periodically, a thread zeroes out all values of the reference bit. A dash for a page frame indicates the page is not in memory. The page-replacement algorithm is localized LRU, and all numbers are provided in decimal.
Page
Page Frame
Reference Bit
0
7
0
1
15
0
2
10
0
3
13
0
4
14
0
5
--
0
6
5
0
7
0
0
8
--
0
9
9
0
10
1
0
11
11
0
12
2
0
13
−
0
14
3
0
15
8
0
Convert the following virtual addresses (in hexadecimal) to the equivalent physical addresses. You may provide answers in either hexadecimal or decimal.
Show the calculation steps
Also set the reference bit for the appropriate entry in the page table.
0xD551
0x8D17
0x33E2
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you
Systems Architecture
Computer Science
ISBN:9781305080195
Author:Stephen D. Burd
Publisher:Cengage Learning
Related Questions
- Given an array A[0..n-1], write the following CUDA program WITHOUT USING SHARED MEMORY: Each thread compares and exchanges two items in each iteration, but using only global memory.(a) Use only one block of threads.(b) Use multiple blocks.arrow_forwardGiven an array A[0..n-1], write the following CUDA program using SHARED MEMORY: Each thread compares and exchanges two items in each iteration, but using only global memory.(a) Use only one block of threads.(b) Use multiple blocks.arrow_forwardGiven an array A[0..n-1], write the following CUDA program: Each thread compares and exchanges two items in each iteration, but using only global memory. (a) Use only one block of threads.arrow_forward
- Question 3: [9 marks]Suppose we have an array of size n which stores random numbers. Also, we have n threads each of which should sum the numbers stored in the array positions from 0 to m-1 where m is the number of that thread, then stores the result in the position m-1.215, 25 ,17 ,1001, … ,5, -150 0 , 1 , 2 , 3 , … , n-2, n-1Apparently the threads can run independently, but when the array inspected after execution the results were incorrect.a) Why the results were incorrect? What is this problem called? ( 4 marks)b) How can such case be handled in order to always guarantee correct results? Suggest the most appropriate solution for this case and justify your choice. (5 marks)Note: No code or mathematical examples or solutions required in your answer.arrow_forwardPlease help me with this operating systems principles homeworkarrow_forwardTopic: OpenMP #pragma omp parallel for and #pragma omp master (Distributed and Parallel Computing Lab) The master construct denotes a block that is only executed by the master thread. Note that there is no synchronization (implicit barrier) for the master construct. The other threads will skip over this block and continue processing without waiting for the master thread. Write a program that computes the average of a large array using a parallel for construct. While it is running using #pragma omp parallel for construct, also use a master construct (outside the for loop) to keep track of how many iterations have been executed and prints out a progress report. Q. The following code is what I have written so far, but the ave(rage) value at the end comes as zero, and the number of iteration was only one, which I don't think it reflects what this program is supposed to do. Please, modify my current code to meet the criteria explained above. #include <omp.h>#include…arrow_forward
- To sort a vector with 1000 elements of random integer numbers: b) Parallelize it using OpenMP for 2, 4, 8, 16, and 32 threads. Report the speedup over the serial implementationarrow_forwardGiven an array A[0..n-1], write the following CUDA program without using shared memory: Each thread compares and exchanges two items in each iterations, but use shared memory and multipleblocks.arrow_forwardGiven an array A[0..n-1], write the following CUDA program using SHARED MEMORY: Each thread compares and exchanges two items in each iterations, but use shared memory and multipleblocks.arrow_forward
- The heart of the recent hit game SimAquarium is a tight loopthat calculates the average position of 256 algae. You areevaluating its cache performance on a machine with a 1,024-byte direct-mapped data cache with 16-byte blocks (B = 16).You are given the following definitions:1 struct algae_position {2 int x;3 int y;4 };56 struct algae_position grid[16][16];7 int total_x = 0, total_y = 0;8 int i, j;You should also assume the following:sizeof(int) = 4.grid begins at memory address 0.The cache is initially empty.The only memory accesses are to the entries of the arraygrid. Variables i, j, total_x , and total_y are storedin registers.Determine the cache performance for the following code:1 for (i = 0; i < 16; i++) {2 for (j = 0; j < 16; j++) {3 total_x += grid[i][j].x;4 }5 }67 for (i = 0; i < 16; i++) {8 for (j = 0; j < 16; j++) {9 total_y += grid[i][j].y;10 }11 }A. What is the total number of reads?B. What is the total number of reads that miss in thecache?C. What is the miss…arrow_forwardPlease answer the following Operating Systems question and its two parts correctly and completely. *If you answer the question's two parts correctly, I will give you a thumbs up. Thanks.arrow_forwardAssume a system using paged virtual memory with a page size of 512. Physical memory in this system contains fewer than 512 frames. See the following two code examples: Example 1: int arr[512][512]; // Each row occupies exactly 1 page for (i = 0; i < 512; i++) { for (j = 0; j< 512; j++) { cout << arr[i]]; } Example 2: int arr[512][512]; // Each row occupies exactly 1 page for (i = 0; i< 512; i++) { for (j = 0; j< 512; j++) { cout << arr[j]0); } For each of these examples, give the number of page faults that will occur. Explain the reasoning for your answers. Use the editor to formot your answerarrow_forward
arrow_back_ios
SEE MORE QUESTIONS
arrow_forward_ios
Recommended textbooks for you
- Systems ArchitectureComputer ScienceISBN:9781305080195Author:Stephen D. BurdPublisher:Cengage Learning
Systems Architecture
Computer Science
ISBN:9781305080195
Author:Stephen D. Burd
Publisher:Cengage Learning