CS代写|CS 475/575 Project #5 CUDA:Monte Carlo Simulation



Monte Carlo simulation is used to determine the range of outcomes for a series of parameters, each of which has a probability distribution showing how likely each option is to happen. In this project, you will take the Project #1 scenario and develop a CUDA-based Monte Carlo simulation of it, determining how likely a particular output is to happen. You will then take timing results and compare them with what you got with OpenMP in Project #1.

The Scenario

Use the same scenario from Project #1.

Using Linux for this Project

On both rabbit and the DGX system, here is a working Makefile:
CUDA_PATH = /usr/local/apps/cuda/cuda-10.1
montecarlo: montecarlo.cu
$(CUDA_NVCC) -o montecarlo montecarlo.cu
On both rabbit and the DGX system, here is a working bash script:
for t in 1024 4096 16384 65536 262144 1048576 2097152 4194304
for b in 8 32 128
/usr/local/apps/cuda/cuda-10.1/bin/nvcc -DNUMTRIALS=$t -DBLOCKSIZE=$b -o montecarlo montecarlo.cu
You can (and should!) write scripts to run the benchmark combinations. If you want to pass in benchmark parameters,the -DNUMTRIALS=$t notation works fine in nvcc.
Before you use the DGX, do your development on the rabbit system. It is a lot friendlier because you don’t have to run your program through a batch submission. you can take your final performance numbers on rabbit, but you will enjoy the numbers you get on the DGX more!
You can also take your benchmark numbers on your own machine.If you are trying to run CUDA on your own Visual Studio system, make sure your machine has the CUDA toolkit installed. It is available here: https://developer.nvidia.com/cuda-downloads Running CUDA in Visual Studio
This requires a special setup so that Visual Studio knows to run nvcc in the right place. See our CUDA noteset for instructions.


1. Use these as the ranges of the input parameters when you choose random parameter values:

Variable Description Minimum Maximum
tx Truck X starting location in feet -10. 10.
txv Truck X velocity in feet/second 15. 35.
ty Truck Y location in feet 40. 50.
sv Snowball overall velocity in feet/second 5. 30.
theta Snowball horizontal launch angle in degrees 10. 70.
halflen Truck half-length in feet 15. 30.
Note: these are not the same numbers as we used before!

2. Run this for at least three BLOCKSIZEs (i.e., the number of threads per block) of 8, 32, and 128, combined with NUMTRIALS sizes of at least 1024, 4096, 16384, 65536, 262144, 1048576, 2097152, and 4194304. You can use more if you want.
3. Be sure each NUMTRIALS is a multiple of 1024. All of the ones above already are.
4. Record timing for each combination. For performance, use some appropriate units like MegaTrials/Second.
5. For this one, use CUDA timing, not OpenMP timing.
6. Do a table and two graphs:

7. Like Project #1 before, fill the arrays ahead of time with random values. Send them to the GPU where they can be used as look-up tables.
8. You will also need these .h files:
Just keep them in your project folder.
9. Your commentary PDF should:
1. Tell what machine you ran this on
2. Show the table and the two graphs
3. What patterns are you seeing in the performance curves?
4. Why do you think the patterns look this way?
5. Why is a BLOCKSIZE of 8 so much worse than the others?
6. How do these performance results compare with what you got in Project #1? Why?
7. What does this mean for the proper use of GPU parallel computing?