Python代写 | User-centric Systems for Data Science Assignment 1


The data we will use for this assignment consist of two CSV files: Friends and Ratings. The former
contains friendship relationships as tuples of the form UserID1 UserID2, denoting two users who are
also friends. A user can have one or more friends and the friendship relationship is symmetric: if A is a
friend of B, then B is also a friend of A and both tuples (A B and B A) are present in the file. Ratings
contains user ratings as tuples of the form UserID MovieID Rating. For example, tuple 12 3 4
means that “the user with ID 12 gave 4 stars to the movie with ID 3”.

Hint #1: You can use Python’s CSV reader to parse input files.

Hint #2: Consider encapsulating your Python tuples in ATuple objects (see code skeleton).

The first task is to implement a query that predicts how much a user will like a particular movie. The
prediction in this case is based on friends’ ratings: the ‘likeness’ of a movie M for a user A is equal to the
average rating for M as given by A’s friends. In SQL, this query can be expressed as follows:

FROM Friends as F, Ratings as R
AND F.UID1 = ‘A’ AND R.MID = ‘M’

Friends and Ratings are both relations (tables): Friends stores friendship relationships whereas
Ratings stores user ratings (cf. Section 1 “Data Schema”).

The above query requires implementing five operators:

1. Scan: An operator that scans an input file (table) and returns its contents (tuples).
2. Project: An operator that implements a SQL projection.
3. Filter: An operator that filters individual tuples based on a user-defined predicate. This is
equivalent to a selection predicate in SQL.
4. Join: An binary operator that applies a relational equality join.
5. AVG: An aggregation function that returns the average of the input values (e.g. ratings).

Hint #1: You might find useful drawing your query plan in the form of a tree, as discussed in Lecture 1.
A node in the plan is an operator that corresponds to a Python class in the code skeleton we provide.

Hint #2: We highly recommend that you employ some of the standard optimization techniques we
discussed in Lecture 1, e.g., pushing the selection (filter) down to the leaves of the query plan. Your
solution will be graded solely based on correctness, however, we should be able to run tests with your
code in a reasonable amount of time.

The second task is to implement a query that recommends a movie to a user: the movie with the higher
‘likeness’ value as computed by the query of TASK I. In SQL, this query can be expressed as follows:

FROM ( SELECT R.MID, AVG(R.Rating) as score
FROM Friends as F, Ratings as R
AND F.UID1 = ‘A’

The recommendation query requires implementing three additional operators:

1. GroupBy: An operator that groups tuples based on a ‘key’ attribute (e.g. movie id).

2. OrderBy: An operator that sorts tuples based on a value (e.g. sort candidate movies in descending
order of their average rating as given by A’s friends).

3. Limit: An operator that returns the first k tuples in its input (k=1 in the query above). An OrderBy
followed by a limit is also known as a top-k operator.
Hint: You will need a custom comparator to sort tuples. Have a look at cmp_to_key() in