数据挖掘代写 | DSCI553 Foundations and Applications of Data Mining

这个作业是用Spark完成数据挖掘相关的习题

DSCI553 Foundations and Applications of Data Mining

4. Tasks
4.1 Task1: Jaccard based LSH (2 points)
In this task, you will implement the Locality Sensitive Hashing algorithm with Jaccard similarity using
yelp_train.csv.
In this task, we focus on the “0 or 1” ratings rather than the actual ratings/stars from the users.
Specifically, if a user has rated a business, the user’s contribution in the characteristic matrix is 1. If the
user hasn’t rated the business, the contribution is 0. You need to identify similar businesses whose
similarity >= 0.5.
You can define any collection of hash functions that you think would result in a consistent permutation
of the row entries of the characteristic matrix. Some potential hash functions are:
f(x)= (ax + b) % m or f(x) = ((ax + b) % p) % m
where p is any prime number and m is the number of bins. Please carefully design your hash functions.
After you have defined all the hashing functions, you will build the signature matrix. Then you will divide
the matrix into b bands with r rows each, where b x r = n (n is the number of hash functions). You should
carefully select a good combination of b and r in your implementation (b>1 and r>1). Remember that
two items are a candidate pair if their signatures are identical in at least one band.
Your final results will be the candidate pairs whose original Jaccard similarity is >= 0.5. You need to write
the final results into a CSV file according to the output format below.
Example of Jaccard Similarity:
Input format: (we will use the following command to execute your code)
./spark-submit task1.py <input_file_name> <output_file_name>
Param: input_file_name: the name of the input file (yelp_train.csv), including the file path.
Param: output_file_name: the name of the output CSV file, including the file path.
Output format:
IMPORTANT: Please strictly follow the output format since your code will be graded automatically. We
will not regrade because of formatting issues.
a. The output file is a CSV file, containing all business pairs you have found. The header is
“business_id_1, business_id_2, similarity”. Each pair itself must be in the alphabetical order. The entire
file also needs to be in the alphabetical order. There is no requirement for the number of decimals for
the similarity value. Please refer to the format in Figure 2.
Figure 2: a CSV output example for task1
Grading:
We will compare your output file against the ground truth file using precision and recall metrics.
Precision = true positives / (true positives + false positives)
Recall = true positives / (true positives + false negatives)
The ground truth file has been provided in the Google drive, named as “pure_jaccard_similarity.csv”.
You can use this file to compare your results to the ground truth as well.
The ground truth dataset only contains the business pairs (from the yelp_train.csv) whose Jaccard
similarity >=0.5. The business pair itself is sorted in the alphabetical order, so each pair only appears
once in the file (i.e., if pair (a, b) is in the dataset, (b, a) will not be there).
In order to get full credit for this task you should have precision >= 0.99 and recall >= 0.97. If not, then
you will get only partial credit based on the formula:
(Precision / 0.99) * 0.4 + (Recall / 0.97) * 0.4
Your runtime should be less than 100 seconds. If your runtime is more than or equal to 100 seconds,
you will not receive any point for this task.
4.2 Task2: Recommendation System (5 points)
In task 2, you are going to build different types of recommendation systems using the yelp_train.csv to
predict the ratings/stars for given user ids and business ids. You can make any improvement to your
recommendation system in terms of the speed and accuracy. You can use the validation dataset
(yelp_val.csv) to evaluate the accuracy of your recommendation systems, but please don’t include it as
your training data.
There are two options to evaluate your recommendation systems. You can compare your results to the
corresponding ground truth and compute the absolute differences. You can divide the absolute
differences into 5 levels and count the number for each level as following: