Spark代写 | CS435 Introduction to Big Data, Fall 2020

这个作业是用Spark实现一个大数据相关的PageRank系统
CS435 Introduction to Big Data, Fall 2020
Programming Assignment 3

Your implementation of this assignment should provide:
A. A sorted list of Wikipedia pages based on their ideal PageRank value in
descending order. Each row should contain the title of the article and its
PageRank value. This computation should be performed in your own Spark
cluster with 5 machines. You should present results from at least 25 iterations.
B. A sorted list (in descending order) of Wikipedia pages based on their PageRank
value with taxation. Each row should contain the title of the article and its
PageRank value. This computation should be performed on your own Spark
cluster with 5 machines. You should present results from at least 25 iterations.
C. A Wikipedia Bomb that returns the “Rocky Mountain National Park” wikipedia
page for the search key word “surfing” (see 3.C). In order to do that, you should
modify the link data file and show that the “Rocky Mountain National Park” page
generates highest PageRank among all of the pages containing “surfing” as part
of their titles.
There are many algorithms to estimate PageRank. You should use the algorithms
included in this description. Also, you are not allowed to use existing PageRank
implementations.
Demonstration of your software should be on the machines in CSB-120. Demonstration
will include an interview discussing implementation and design details. Your submission
should include your source codes and outputs. Your submission should be via Canvas.
(3) PageRank
The term PageRank comes from Larry Page, a founder of Google. PageRank is a
function that assigns a real number to each page in the Web (in our case Wikipedia
articles). A page with a higher PageRank value will be considered as a more important
page. There are many algorithms to assign PageRank. In this assignment, we will
consider two algorithms
4
: (1) idealized PageRank and (2) taxation based PageRank.
A. Idealized PageRank
Suppose that our Web is a directed graph where pages are the nodes and there is (are)
edge(s) from page p1
to page p2
. The edges between nodes represent links between
the web pages. Figure 1 depicts the links between 4 pages (A, B, C and D) as a graph.
Page A has links to each of the other three pages B, C and D. Page B has links to A and
D. Page C has a link only to A. Finally, page D has links to B and C.
Suppose a random surfer starts at page A (in Figure 1). The probability that the random
surfer will visit each of the pages B, C, and D in his next stop is 1/3. Note that if a page
contains a link to itself, it should be represented in the graph as a loop. The random
surfer that arrived at page B will likely to visit A with probability of ½.
This graph can be defined as the transition matrix of the Web (here, Wikipedia), and the
transition matrix for the previous example depicted in the Figure 1 is:
In the above matrix, the order of the pages is the natural one, A, B, C, and D. The first
column shows that a random surfer at A has probability of 1/3 of visiting each of the
other pages.
An idealized PageRank supposes that a random surfer may start from any of the n pages
with equal probability. The initial vector v0
for our previous example is (¼, ¼, ¼, ¼). After
one step, the distribution of the surfer will be Mvo
, after two steps it will be
M((Mvo
) = M2v0
, and so on. Multiplying the initial vector v0 by M a total of i times will
provide us the distribution of the surfer after i steps.
After enough number of iterations, the vector will show little change at each round. For
this assignment your iteration count is as specified earlier.
B. Taxation
Wikipedia contains dead-end articles
5
. A dead-end article is a page that contains no
links to other Wikipedia articles. Since there is no link to go out, once a random surfer
visits the dead-end article, this random surfer never leaves the article. This often results
in 0 probabilities in many of the pages as the number of iteration increases.
To deal with dead-end articles, the PageRank algorithm modifies the idealized process
by which random surfers are assumed to move without following the links. This method
is referred to as “taxation”. Taxation allows each random surfer to have a small
probability of teleporting to a random page, rather than following an out-link from their
current page.
With taxation, the new vector for estimating the PageRank will be,
where β is a chosen constant specifying the probability that the random surfer decides
to follow an out-link from their present page. Usually in the range of 0.8 to 0.9, e is a
vector of all 1’s with the appropriate number of components (i.e. Wikipedia pages), and
n is the number of nodes in the Web graph. In this assignment, you should use β as
0.85.
You should perform this processing using Apache Spark. Please ignore all of the
possible errors due to the data segmentation by HDFS.
C. Wikipedia Bomb
The term Wikipedia Bomb is derived from the Google bombs for this assignment. The
Google bomb refers to the practice of causing a web site to rank highly in web search
engine results for unrelated or off-topic search terms by linking heavily
6
.
In this assignment, you should generate a Wiki Bomb for “surfing” in Wikipedia and the
result should be the “Rocky Mountain National Park” page in Wikipedia.
To generate this, you are allowed to modify the link data file. You should show the
results and discuss during the demo.