# Spark代写 | CS435 Introduction to Big Data, Fall 2020

这个作业是用Spark实现一个大数据相关的PageRank系统

CS435 Introduction to Big Data, Fall 2020

Programming Assignment 3

Your implementation of this assignment should provide:

A. A sorted list of Wikipedia pages based on their ideal PageRank value in

descending order. Each row should contain the title of the article and its

PageRank value. This computation should be performed in your own Spark

cluster with 5 machines. You should present results from at least 25 iterations.

B. A sorted list (in descending order) of Wikipedia pages based on their PageRank

value with taxation. Each row should contain the title of the article and its

PageRank value. This computation should be performed on your own Spark

cluster with 5 machines. You should present results from at least 25 iterations.

C. A Wikipedia Bomb that returns the “Rocky Mountain National Park” wikipedia

page for the search key word “surfing” (see 3.C). In order to do that, you should

modify the link data file and show that the “Rocky Mountain National Park” page

generates highest PageRank among all of the pages containing “surfing” as part

of their titles.

There are many algorithms to estimate PageRank. You should use the algorithms

included in this description. Also, you are not allowed to use existing PageRank

implementations.

Demonstration of your software should be on the machines in CSB-120. Demonstration

will include an interview discussing implementation and design details. Your submission

should include your source codes and outputs. Your submission should be via Canvas.

(3) PageRank

The term PageRank comes from Larry Page, a founder of Google. PageRank is a

function that assigns a real number to each page in the Web (in our case Wikipedia

articles). A page with a higher PageRank value will be considered as a more important

page. There are many algorithms to assign PageRank. In this assignment, we will

consider two algorithms

4

: (1) idealized PageRank and (2) taxation based PageRank.

A. Idealized PageRank

Suppose that our Web is a directed graph where pages are the nodes and there is (are)

edge(s) from page p1

to page p2

. The edges between nodes represent links between

the web pages. Figure 1 depicts the links between 4 pages (A, B, C and D) as a graph.

Page A has links to each of the other three pages B, C and D. Page B has links to A and

D. Page C has a link only to A. Finally, page D has links to B and C.

Suppose a random surfer starts at page A (in Figure 1). The probability that the random

surfer will visit each of the pages B, C, and D in his next stop is 1/3. Note that if a page

contains a link to itself, it should be represented in the graph as a loop. The random

surfer that arrived at page B will likely to visit A with probability of ½.

This graph can be defined as the transition matrix of the Web (here, Wikipedia), and the

transition matrix for the previous example depicted in the Figure 1 is:

In the above matrix, the order of the pages is the natural one, A, B, C, and D. The first

column shows that a random surfer at A has probability of 1/3 of visiting each of the

other pages.

An idealized PageRank supposes that a random surfer may start from any of the n pages

with equal probability. The initial vector v0

for our previous example is (¼, ¼, ¼, ¼). After

one step, the distribution of the surfer will be Mvo

, after two steps it will be

M((Mvo

) = M2v0

, and so on. Multiplying the initial vector v0 by M a total of i times will

provide us the distribution of the surfer after i steps.

After enough number of iterations, the vector will show little change at each round. For

this assignment your iteration count is as specified earlier.

B. Taxation

Wikipedia contains dead-end articles

5

. A dead-end article is a page that contains no

links to other Wikipedia articles. Since there is no link to go out, once a random surfer

visits the dead-end article, this random surfer never leaves the article. This often results

in 0 probabilities in many of the pages as the number of iteration increases.

To deal with dead-end articles, the PageRank algorithm modifies the idealized process

by which random surfers are assumed to move without following the links. This method

is referred to as “taxation”. Taxation allows each random surfer to have a small

probability of teleporting to a random page, rather than following an out-link from their

current page.

With taxation, the new vector for estimating the PageRank will be,

where β is a chosen constant specifying the probability that the random surfer decides

to follow an out-link from their present page. Usually in the range of 0.8 to 0.9, e is a

vector of all 1’s with the appropriate number of components (i.e. Wikipedia pages), and

n is the number of nodes in the Web graph. In this assignment, you should use β as

0.85.

You should perform this processing using Apache Spark. Please ignore all of the

possible errors due to the data segmentation by HDFS.

C. Wikipedia Bomb

The term Wikipedia Bomb is derived from the Google bombs for this assignment. The

Google bomb refers to the practice of causing a web site to rank highly in web search

engine results for unrelated or off-topic search terms by linking heavily

6

.

In this assignment, you should generate a Wiki Bomb for “surfing” in Wikipedia and the

result should be the “Rocky Mountain National Park” page in Wikipedia.

To generate this, you are allowed to modify the link data file. You should show the

results and discuss during the demo.