Python数据挖掘代写 | CSE 158/258 Fall 2021: Homework 1


Please submit your solution by the beginning of the week 3 lecture (Oct 11). Submissions should be made on gradescope. Please complete homework individually.

This specification includes both questions from the undergraduate (CSE158) and graduate (CSE258) classes.
You are welcome to attempt questions from both classes but will only be graded on those for the class in which you are enrolled.

You will need the following files:

GoodReads Fantasy Reviews :
Beer Reviews : The above is a json formatted dataset. Data can be read using the json.loads function in Python, or by using eval.

Code examples : (regression) and http:
// (classification)

Executing the code requires a working install of Python 2.7 or Python 3 with the scipy packages installed.

Please include the code of (the important parts of) your solutions.

First, using the book review data, let’s see whether ratings can be predicted as a function of review length, or by using temporal features associated with a review.

1. (CSE158 only) What is the distribution of ratings and review lengths in the dataset? Report the number of 1-, 2-, 3-star (etc.) ratings, and show the relationship with length (e.g. via a scatterplot) (1 mark).

2. Train a simple predictor that estimates rating from review length, i.e.,
star rating ‘ 0 + 1  [review length in characters]
Report the values 0 and 1, and the Mean Squared Error of your predictor (on the entire dataset) (1 mark).

3. Extend your model to include (in addition to the length) features based on the time of the review. You can parse the time data as follows:

import dateutil.parser
> t = dateutil.parser.parse(d[‘date_added’])
> t.weekday(), t.year # etc.
Using a one-hot encoding for the weekday and year, write down feature vectors for the rst two examples
(1 mark).

4. Train models that

• use the weekday and year values directly as features, i.e.,star rating ‘ 0 + 1  [review length in characters] + 2  [t.weekday()] + 3  [t.year]

• use the one-hot encoding from Question 3.
Report the MSE of each (1 mark).

5. Repeat the above question, but this time split the data into a training and test set. You should split the data randomly into 50%/50% train/test fractions. Report the MSE of each model separately on the training and test sets.

6. (CSE258 only) Show that for a trivial predictor, i.e., y = 0, the best possible value of 0 in terms of the Mean Absolute Error is the median of the label y. Hint: compute the derivative of the model’s MAE and solve for 0