数据库代写|W4112 Database Systems Implementation Homework 2

这是一个美国的数据库作业代写

1. In class we saw that hash partitioning of a table T larger than RAM could be useful for various
relational algebra operations. Suppose our RAM is M blocks big. Suppose also that (unlike what we
covered in class) we separately count the number s of sequential I/O requests and the number r of
random I/O requests. The overall cost would then be s∗Cs +r ∗Cr where Cs is the cost of a sequential
I/O and Cr is the cost of a random I/O. For some devices, Cr is much bigger than Cs.

In this question, we’ll try to convert some of the random I/Os into sequential I/Os. In particular,
instead of bringing in one block from the input, we’ll bring in k contiguous blocks at a time. Similarly,
for each hash partition, we’ll buffer k blocks and only write out data to the disk in units of k contiguous
blocks.

(a) Calculate r and s for a single partitioning pass through the input, in terms of k and jT j (the
number of blocks in T).

(b) Write an inequality that represents the conditions on k and jT j that ensure the partitions will
each fit in RAM after one partitioning pass. For this question you may assume that the hash
function perfectly partitions the data into equal-sized pieces.

(c) How would your answer to the previous question change if you also implemented double buffering
to overlap CPU and I/O latencies?

(d) Given your answers above, how would you choose k in practice?

2. Table R has 500,000,000 records, a clustering B-tree index on R:A, a secondary hash index on R:B, and
a secondary B-tree index on (R:C; R:D). A disk page holds 100 records from R on average, and tree
nodes hold 500 records on average in a disk page. According to the catalog, A is a key for the table,
there are 1,000,000 distinct B values, 15,000 distinct C values, and 100 distinct D values. For each
of the SQL or relational algebra queries below, give the best access plan for the query and estimate
its cost. You can use Figure 15.3 for cost formulas, and assume tS = 5 milliseconds and tT = 0:1
milliseconds.

(a) σA=5(R).
(b) σB=5(R).
(c) σC=5(R).
(d) σB=5 and C=5(R).
(e) σC=5 and D=5(R).
(f) σB=5 or C=5(R).
(g) σC=5 or D=5(R).
(h) Select C, sum(D) From R Group by C
(i) Select D, sum(C) From R Group by D

3. We saw in class that multidimensional histograms can help determine selectivities for correlated
columns. There are several ways to choose the bucket boundaries for such histograms. Take a look at

http://www.vldb.org/conf/1997/P486.PDF and answer the following:

(a) Based on Section 5 of that paper, explain how the MHIST-2 method chooses bucket boundaries
for 2-dimensional histograms. (Two paragraphs is about right.)

(b) Based on Section 7, outline how the authors determine that MHIST-2 performs well. (Two
paragraphs is plenty.)

4. Take a look at https://www.ibm.com/docs/en/db2/11.1?topic=methods-scan-sharing and answer
the following:

(a) Explain the concept of scan sharing. What performance advantages does it offer?