# 数据库代写｜W4112 Database Systems Implementation Homework 2

这是一个美国的数据库作业代写

1. In class we saw that hash partitioning of a table T larger than RAM could be useful for various

relational algebra operations. Suppose our RAM is M blocks big. Suppose also that (unlike what we

covered in class) we separately count the number s of sequential I/O requests and the number r of

random I/O requests. The overall cost would then be s∗Cs +r ∗Cr where Cs is the cost of a sequential

I/O and Cr is the cost of a random I/O. For some devices, Cr is much bigger than Cs.

In this question, we’ll try to convert some of the random I/Os into sequential I/Os. In particular,

instead of bringing in one block from the input, we’ll bring in k contiguous blocks at a time. Similarly,

for each hash partition, we’ll buffer k blocks and only write out data to the disk in units of k contiguous

blocks.

(a) Calculate r and s for a single partitioning pass through the input, in terms of k and jT j (the

number of blocks in T).

(b) Write an inequality that represents the conditions on k and jT j that ensure the partitions will

each fit in RAM after one partitioning pass. For this question you may assume that the hash

function perfectly partitions the data into equal-sized pieces.

(c) How would your answer to the previous question change if you also implemented double buffering

to overlap CPU and I/O latencies?

(d) Given your answers above, how would you choose k in practice?

2. Table R has 500,000,000 records, a clustering B-tree index on R:A, a secondary hash index on R:B, and

a secondary B-tree index on (R:C; R:D). A disk page holds 100 records from R on average, and tree

nodes hold 500 records on average in a disk page. According to the catalog, A is a key for the table,

there are 1,000,000 distinct B values, 15,000 distinct C values, and 100 distinct D values. For each

of the SQL or relational algebra queries below, give the best access plan for the query and estimate

its cost. You can use Figure 15.3 for cost formulas, and assume tS = 5 milliseconds and tT = 0:1

milliseconds.

(a) σA=5(R).

(b) σB=5(R).

(c) σC=5(R).

(d) σB=5 and C=5(R).

(e) σC=5 and D=5(R).

(f) σB=5 or C=5(R).

(g) σC=5 or D=5(R).

(h) Select C, sum(D) From R Group by C

(i) Select D, sum(C) From R Group by D

3. We saw in class that multidimensional histograms can help determine selectivities for correlated

columns. There are several ways to choose the bucket boundaries for such histograms. Take a look at

http://www.vldb.org/conf/1997/P486.PDF and answer the following:

(a) Based on Section 5 of that paper, explain how the MHIST-2 method chooses bucket boundaries

for 2-dimensional histograms. (Two paragraphs is about right.)

(b) Based on Section 7, outline how the authors determine that MHIST-2 performs well. (Two

paragraphs is plenty.)

4. Take a look at https://www.ibm.com/docs/en/db2/11.1?topic=methods-scan-sharing and answer

the following:

(a) Explain the concept of scan sharing. What performance advantages does it offer?