CUDA编程代写 | COMP 3231 2020-2021

COMP 3231 2020-2021

Question 1 (CUDA Programming). (30’)
In this question, you are going to write a CUDA program that finds the largest single-precision float value in a large
array. You are given a sample code, which contains the basic code and supplementary functions to compute
the max value in a float array. However, the current code only works for array of length < 256. Your code must be able to run on the CS GPU FARM. a). Fix the host code so that it can handle array of arbitrary length. (20’) 1. Change *only* the host (CPU) code so that it iteratively launches kernels to find the maximum value of arbitrary length vectors. This should involve some sort of loop wrapping the kernel switch and case statements. 2. Make sure it can run for very large (1,000,000) vectors. 3. HINT: if every block calculates a local maximum, what is the new problem size/vector length for the next kernel launch? Do not worry about the inefficiencies of multiple memory transfers. 4. All reductions (i.e., comparison) should be done in GPU. You cannot generate an intermediate array and find the max using CPU. b). Change the kernel code to leverage shared memory. (10’) 1. The latency of an access to shared memory (which resides inside an SMM/SMX) is orders of magnitude less than an access to main memory. 2. Re-write the kernel so that all threads in a thread block first load their values to a buffer in shared memory. Once all of these values are loaded, have our threadIdx.x == 0 thread (I’ll call this the “lead” thread) find the maximum from this structure. 3. Remember that relaxed consistency means that there is no guarantee as to when the lead thread will see other thread’s writes to the shared memory. Place fences/barriers accordingly. 4. You should *not* change the CPU code in this part; i.e., you should use the CPU code you wrote for part (a). 5. Part b will only be evaluated based on correctness. Question 2-3 (CPU Pipeline). In these two questions, you are going to implement a simple superscalar processor, with instruction pipelining, out-oforder execution (Tomasulo algorithm), and branch prediction. Assumptions: For simplicity, you do not have to model issue width, retire width, number of result buses and PRF (Physical Register File) ports. Assume these are unlimited and do not stall the processor pipeline. Some other assumptions that have been made have been described where relevant. Code template: We have provided you with the following files: 1. Makefile: to compile your code 2. Driver.cpp: contains the main() method to run the simulator: 3. Processor.hpp: Used for defining structures and method declarations: you may edit this file to declare or define structures and methods 4. Processor.cpp: All your methods are written here 5. Traces: contains the traces to pass to the simulator (more details in the later section How to run: The generated binary (processor) should be run from the root directory as: cat tracefile |./processor –f F –k K –l L -m M -r R The command line parameters are as follows: • F – Dispatch rate (instructions per cycle) • K – Number of k0 function units # ALU Units (Used for BRANCH as well) • L – Number of k1 function units # MUL Units • M – Number of k2 function units # LOAD / STORE instruction • R – Number of reservation stations per function unit type. That means there are a total of R x (K + L + M) reservation stations • – Path name to the trace file