C语言代写 | W4112 Database Systems Implementation
In this project, you are asked to implement in C algorithms for various versions of binary search and to use them to implement an in-memory band-join algorithm. Grading will be based on both your code, and your analysis of the performance results you generate on Google cloud machines. You will work in groups of two or three. For three person groups, there is an additional programming requirement.
By default we’ll be using the same groups as for project 1. If your partner from project 1 dropped the class, or if you wish to work with diﬀerent people this time, the TAs will facilitate the matching of students. The TAs will post details of the matching process on Ed. If for some reason (e.g., a partner drops the class well into the semester) a student ends up working alone, the student is expected to complete the entire project expectations for 2-person groups.
The project consists of several parts, each with its own submission deadline. You can download the code templates from courseworks under the “Files” tab. You will be running the code on Google cloud machines. The TAs will provide each of you with a $50 coupon for the use of the Google cloud. The TAs will separately provide documentation for how to use the Google cloud for this project. In particular, it will be important for you to select the right class of machines that support the full AVX512 SIMD instruction set.
Details of exactly how to submit the code and pdf ﬁles will be provided later.
In this part of the project, you will explore the conversion of control dependencies to data dependencies to improve performance. You are supplied with a fully functional C function lower bound that performs a binary search where you want the the index of the ﬁrst key that is bigger than or equal to the search key. You are initially expected to write two C functions lower bound nb arithmetic and lower bound nb mask that replace the if test in lower bound with code that does not use either the if statement or the C idiom condition ? x : y. As implied by the names of the functions, lower bound nb arithmetic should use arithmetic operations to avoid if statements, while lower bound nb mask should use bit manipulation operations (e.g., &,|,~) and masking. Hint: you may use one subtraction in the lower bound nb mask function, and the 64-bit representation of -1 is 0xFFFFFFFFFFFFFFFF.
You should now write a function lower bound nb mask 8x that tries to do 8 binary searches in parallel, hoping that the underlying system will overlap the cache miss latencies for concurrent searches. Inside this routine, there should be an outer while loop to check a termination condition, and an inner for loop to iterate through each of the 8 searches. The inner loop should use the same mask-based computation as your lower bound nb mask algorithm. These is a subtlety in this parallelization in that some of the 8 searches may need more iterations than others. In such a case your loop should continue until all searches are complete and should make sure that “extra” loops of a completed search do not change the search result.
Finally, you should write a function lower bound nb mask 8x AVX512 that uses AVX512 SIMD instruc- tions to do 8 binary searches in parallel. In AVX512, 512-bit registers can hold eight 64-bit values and operate on the component values in parallel. You can ﬁnd a set of functions called “intrinsics” that will be useful for the project here. For our purposes, you will primarily need functions that have a mm512 preﬁx and a suﬃx epi64 meaning that they operate on 64-bit signed integers. Within these instructions, you’ll probably want to use some combination of the following types of instructions:
setl These instructions allow you to load all elements of a vector with a scalar value.
cmp.. These are various comparison instructions such as cmpge and cmplt that compare two registers and
set a scalar bitmap as the result.
srli Shift right. Handy for dividing by 2.
mask blend Combine data from two diﬀerent registers into a third register depending on the mask bitmap.
i64gather Load 8 values from 8 diﬀerent oﬀsets from a base address.
add,sub,… There are various arithmetic functions available.
You will proﬁle the code using the provided bulk binary search and bulk binary search 8x functions. Simply uncomment the variant of the code that you want to measure within bulk binary search and bulk binary search 8x. You will run the code by invoking
./db4112 N X Y Z R
For this part of the project you can ignore X, Y and Z and supply dummy values. N is the number of elements in the array being searched, which is also the same as the number of probes. For this test, the provided code searches the array for each of its elements in a random order. R is the number of repeats of the experiment within the timing loop, which you should set to an appropriate number to deliver stable results. (If this number is too low, then various overheads of setting up the search may dominate the measured time; if this number is too high, your experiment will take too long.) For example, when I run
./db4112 10000 1 1 1 1000
to proﬁle lower bound on a local machine, the output is
Time in bulk_binary_search loop is 1016150 microseconds or 0.101615 microseconds per search Time in bulk_binary_search_8x loop is …
For this part of the project you will submit your code for the lower bound nb arithmetic, lower bound nb mask, lower bound nb mask 8x and lower bound nb mask 8x AVX512 functions. You will also submit a pdf doc-ument containing a graph showing the measured performance of the ﬁve routines as you varied N from 10 to 107 . So N should be on the x-axis in log-scale, time-per-search should be on the y-axis in linear-scale, and your graph should have ﬁve labeled curves, one per function. Make sure to measure enough N values to generate reasonable resolution in your results. In the pdf document, you should: