代码代写|COM2034: Information Retrieval

这是一篇来自英国的关于信息检索的代码代写限时测试

Section A (25 marks)

A1. (Section A only includes A1)

The figure below shows the architecture of the Langville & Meyer search system that we had covered in the module.

代码代写 COM2034 Information Retrieval

Answer the following questions, showing any necessary working

(a)Which component has most of the unstructured data in the entire system?[1 mark]

(b)State the names of components/parts A to C in the above figure.[3 marks]

(c)Describe in your own words the component C and how it relates to the “Ranking Module”[2 marks]

(d)Name a component that significantly compresses data, and name a way of how it can achieve this.[3 marks]

(e)Briefly describe the process of tokenization, and explain why it is used.[2 marks]

(f)Let us assume that we have a very large data repository. The data comprises audio,video, text and many other types of datasets. Name an additional criterion that is important to consider, in order to judge whether the repository is well-suited for Information Retrieval. Briefly explain your answer. Hint: “Big data”[2 marks]

(g)Let us assume that a repository X has a vocabulary size of 200,000 word and comprises 2,000,000 documents, each of which has a vocabulary of approx. 500 words. What type of index would you not recommend and why?[3 marks]

(h)The table below shows a binary term-document incidence matrix for a sample data collection:

Answer the following questions, clearly showing your working:

(I)Determine, using the above matrix, the documents that are relevant to the query “Spring” AND “Flower”. Write down the steps to arrive at your answer.[3 marks]

(ii)Determine, using the above matrix, the documents that are relevant to the query “Gel” AND NOT “Window”. Write down the steps to arrive at your answer.[2 marks]

(iii)Briefly explain why using the Cosine similarity is not well-suited to identify the most relevant documents based on binary term-document incidence matrices.[2 marks]

(iv)Would the number of zero elements in the table increase or decrease if synonyms would be accounted for? Justify your answer.[2 marks]

Section B (40 marks)

B1.Heap’s law estimates vocabulary size as a function of collection size:

Where k and b are parameters, while M indicates the number of terms and T the number of Tokens.

(a)Let us assume a collection X with 2,000,000 tokens has 27,386 terms. Given k = 40,what is b? And assuming we randomly selected 1 million tokens, how many terms do we expect to obtain? Provide your calculations.[4 marks]

(b)Another collection Y with 1,675,023 tokens has the same value for k, but a value b=0.55. How many terms do you estimate Y has?[1 mark]

(c)Words that occur twice in the collection X make up 120 of all the word-types (=vocabulary) in X. How many word-types do you estimate (approximately) are from the words that occur only once, and why?[3 marks]

(d)Collection Y has 20 documents. The word “human” occurs in 6 of these. It occurs 7 times in document A, and 25 times in document B. A has 250 tokens, while B has 375 tokens. What are the tf-idf scores for these two documents, using the logarithm with base 2?[4 marks]

B2.Below is a table summarising the occurrences for different words in documents ofcollection X, as well as the number of tokens in each document.

Document (# of tokens)

(a)Randomly choose two documents and compute the cosine similarity of these using absolute frequencies. State the distance for each pair to 4 decimal places (4 d.p.).[2 marks]

(b)Use Manhattan Distance to calculate similarity values for the two pairs of documents (B,C), (D,F) with relative frequency values. State the distance for each pair to 4 d.p..[4 marks]

(c)Use Euclidean Distance to calculate similarity values for the two pairs of documents (C,F), (D,E) with relative frequency values. State the distance for each pair to 4 d.p..[4 marks]

(d)Explain “synonymy” and “polysemy” and why they are problematic for Boolean search engines[3 marks]

(e)In your own words, explain “semantic distance” and how it relates to “Query expansion”[2 marks]

(f)A query in a collection X of 850 documents returns 16 documents. 3 documents are relevant but were not returned. 2 of the returned documents are not relevant. State the number of True Negatives (TN), False Negatives (TN), False Positives (FP) and True Positives (TP) for this query.[4 marks]

(g)What are the Recall and Precision values?[2 marks]

(h)What are the f(0.5), f(1) and f(2) measures (to 2 d.p.)?[3 marks]

(i)Briefly explain how using a more aggressive stemming algorithm that prunes more letters will likely affect Recall and Precision in Information Retrieval systems?[2 marks]

(j)Accuracy is defined as = (TP+TN)/(TP+TN+FP+FN)

Describe a possible query scenario where the accuracy will be high, but TN low?[2 marks]