数据分析代写|Project Statement for Milestone 3

本次美国代写是一个数据分析可视化的assignment

In this milestone, you need to finish all required functions for the candidate project you chose
except the visualization part. You need to submit the printed version of your Jupyter notebook.
This notebook should show the code of each required function and the output of the code. This
milestone is worth 10% of your overall grade.

To print your notebook as a PDF, right-click the notebook and choose “print” -> “saveAsPDF”. If
you cannot save it as PDF, you can opt to do screenshots and paste them to a Word document.

Project 1: YouTube Analyzer (10%)

Video search (6%). You must use PySpark for the following functions, no Pandas or other
Python libraries are allowed.

– (2%) Categorized statistics: frequency histogram of videos partitioned by a search
condition: categorization, size of videos, view count, etc. For example, count of videos
per category.

– (2%) top k queries: (1) find top k categories in which the most number of videos are
uploaded; (2) top k most viewed videos;

– (2%) Range queries: (1) find all videos in categories X with duration within a range [t1,
t2]; (2) find all videos with size in range [x,y].

Graph analytics (4%): You must use Spark GraphX or GraphFrame

Hint: you can use PySpark “explode” function to create a proper edge DataFrame for
GraphFrame: https://stackoverflow.com/questions/40099706/splitting-a-row-in-a-pyspark
dataframe-into-multiple-rows

– (2%) Network aggregation: report the following statistics of Youtube video network: (1)
in-degree and out-degree of each video; (2) average degree, maximum and minimum
degree of the video datasets.

– (2%) Top-K Influence analysis: Use PageRank algorithms over the Youtube network to
compute the scores efficiently. Intuitively, a video with high PageRank score means that
the video is related to many videos in the graph, thus has a high influence. Find top k
most influence videos in Youtube network. Choose the initial values and number of
iterations as you see fit.