Neo4j代写 | Romeo and Juliet friendship graph
本次作业案例分享主要内容是完成一个好友推荐系统的Neo4j代写
1. BACKGROUND
Computer systems that make recommendations are called recommender systems. Netflix recommends movies you may like to watch, Amazon recommends products you may be interested in purchasing, and Facebook recommends people you may want to contact and become friends with. The actual algorithms used by these companies to make such suggestions are closely guarded trade secrets, and they are not available for public scrutiny. However, in this assignment, you will have an opportunity to learn the basics behind recommender systems. You will use Facebook as a case study for recommender systems. First, you will have to create a database to store Facebook users and such a database will have to be a Neo4j graph database.
2. FRIENDSHIP GRAPHS
Figure 1 displays the relationships (friendships) graph for the main characters in Romeo and Juliet, the tragedy published by William Shakespeare in 1597.
A link between character A and character B in Figure 1 means that A considers B a friend, and also B considers A a friend—these links are undirected, despite the arrows. Note that Figure 1 shows the characters who belong to the House of Montague in orange, the characters who belong to Ruling House of Verona in green, the characters who belong to the House of Capulet in pink, the Franciscans in purple, and the Nurse who works for the House of Capulet in blue. The colours are just to help us interpret the graph—colours are not mandatory in your assignment.
Figure 1. Romeo and Juliet friendship graph
3. FRIEND RECOMMENDATIONS
Formally speaking, recommending a new friend is equivalent to answering the following question: “For user X, list some non-friends of X in order, starting with the best friend recommendation and ending with the worst.” A non-friend is a user who is not X and is not a friend of X. For example, for the user Mercutio in Figure 1, the list of non-friends is:
• Benvolio
• Capulet
• Friar Lawrence
• Juliet
• Montague
• The Nurse
• Tybalt
3.1. Recommend by number of common friends
If non-friend Y is your friend’s friend, then maybe Y should be your friend too. If user Y is the friend of many of your friends, then Y is an even better recommendation. The best friend recommendation is the user with whom you have the largest number of mutual friends. For example, consider Mercutio in Figure 1.
• Mercutio has one friend in common with Benvolio (Romeo).
• Mercutio has two friends in common with Capulet (Paris and Prince Escalus).
• Mercutio has one friend in common with Friar Lawrence (Romeo).
• Mercutio has one friend in common with Juliet (Romeo).
• Mercutio has two friends in common with Montague (Prince Escalus and Romeo).
• Mercutio has no friends in common with the Nurse.
• Mercutio has no friends in common with Tybalt.
Therefore, Capulet and Montague are the best friend recommendations for Mercutio, and the Nurse and Tybalt are the worst recommendations.
Note that the recommendations might not be symmetric: the best friend recommendation for Montague might be Mercutio, but the best friend recommendation for Mercutio might be Capulet, rather than Montague.
3.2. Recommend by influence scoring
We will now study a different algorithm. Consider the following hypothetical situation.
• Two of your friends are J. K. Rowling and James Corden.
• J. K. Rowling has only two friends (you and one other person).
• James Corden has 7,000,000,000 friends.
• J. K. Rowling and James Corden have no friends in common (apart from you).
Since J. K. Rowling is highly selective in terms of friendship, and is a friend of yours, you are likely to have a lot in common with J. K. Rowling’s other friend. On the other hand, James Corden is indiscriminate and there is little reason to believe that you should be friendly with any particular one of James Corden’s other friends.
Let us incorporate the above idea into a friend recommendation algorithm. Here is the concrete way to do so—note that we call the technique “influence scoring”: Suppose that PersonA and PersonB have three friends in common: Friend1, Friend2, and Friend3. When recommending friends by influence scoring, the score for PersonB as a friend of PersonA is 1/numfriends(Friend1)
+ 1/numfriends(Friend2) + 1/numfriends(Friend3), where numfriends(f) is the number of friends that f has. In other words, each friend F of PersonA has a total influence score of 1 to contribute, and divides it equally among all of her friends.
In the example above, J. K. Rowling’s other friend would have a score of 1/2, and each of James Corden’s friends would have a score of 1/7000000000.
3.3. Algorithm comparison
Does the change of recommendation algorithm make any difference? As you should be able to find out, Mercutio gets the same friend recommendations with both recommendation algorithms (number of common friends and influence scoring). In fact, there are 5 characters in the Romeo and Juliet scenario for whom the recommendations are the same, regardless of the algorithm. However, there are 6 characters for whom the recommendations are different.
4. THE DATASET
The assignment employs a Facebook dataset provided as a courtesy by the Max Planck Institute for Software Systems. The data is described in more detail in the following publication:
Viswanath, B., Mislove, A., Cha, M. and Gummadi, K.P., 2009, August. On the Evolution of User Interaction in Facebook. In Proceedings of the 2nd ACM Workshop on Online Social Networks (pp. 37-42). ACM.
The Facebook dataset contains, exactly, 63,731 Facebook users—no duplicates are included— and 817,090 friendship relationships among them—this is the total number of links established via friendships. Do not be alarmed if importing the Facebook dataset into Neo4j takes a while. The number of friendships is large! However, if it takes longer than 10 minutes, it probably means that you are doing something wrong—perhaps, you did not index the right attributes in your database. Also, do not try to draw the whole Facebook graph at once—this may cause your computer to “freeze”. Besides, even if you successfully draw the entire graph, you would not learn much from a tangled mess of 817,090 links.
The Facebook dataset for your coursework can be downloaded from the module DLE website under the Coursework Section—look for a number of files stored in a folder named Facebook Dataset. Although the dataset is spread across different files, you are expected to import every single data item into the database that you will submit—marks will be deducted for failing to include all the data provided.
4. THE PROBLEM
4.1. Recommend by number of common friends
For every Facebook user in the dataset with an ID number that is a multiple of 980, provide a list containing the first 10 friend recommendations, as determined by the number of common friends. If there are fewer than 10 recommendations, provide all the recommendations. For each case, you will have to provide the Neo4j CQL query that generated the friend recommendations, as well as the actual recommendations. Note that the recommendations alone will not receive any marks, and the Neo4j CQL queries must be provided in text format (screenshots will be awarded zero marks). Also, note that your queries will be tested. Hence, queries that do not produce the submitted results will invalidate the answers and zero marks will be awarded.
An example of the format in which your results should be delivered is listed below in Table 1. Please, note that the CQL query and the recommendations listed in Table 1 are shown only to illustrate the format, and they are not actual solutions to the problem in question.
User ID CQL Query Friend recommendation
… … …
13720 Match (friend:FacebookUser {id: ‘13790’})-[r:FRIENDS_WITH]-(
user:FacebookUser) RETURN friend 17125, 7033, 15462,
33049, 51105, 16424,
23, 7996, 1539,
17420
14700 Match (friend:FacebookUser {id: ‘14775’})-[r:FRIENDS_WITH]-(
user:FacebookUser) RETURN friend 14473, 14495, 17951,
19611, 22749, 23259,
30002, 3154, 8269,
862
15680 Match (friend:FacebookUser {id: ‘15760’})-[r:FRIENDS_WITH]-(
user:FacebookUser) RETURN friend 28606
… … …
4.2. Recommend by influence scoring
For every Facebook user in the dataset with an ID that is a multiple of 980, provide a list containing the first 10 friend recommendations, as determined by influence scoring. If there are fewer than 10 recommendations, provide all the recommendations.
Once again, the output format should be the same shown in Table 1. You are expected to provide the Neo4j CQL query that generated the friend recommendations and the actual recommendations. Note that the recommendations alone will not receive any marks, and the Neo4j CQL queries must be provided in text format (screenshots will be awarded zero marks). Also, note that your queries will be tested. Hence, queries that do not produce the submitted results will invalidate the answers and zero marks will be awarded.
4.3. Algorithm comparison
Considering only those 65 Facebook users with an ID that is a multiple of 980, identify the Facebook users who have the same first 10 friend recommendations under both recommendation algorithms, and the Facebook users who have different first 10 friend recommendations under the two algorithms. Your answers should appear in the format depicted in Table 2 (see the following page. Please, note that the answers listed in Table 2 are shown only to illustrate the format in which you should submit the solutions; yet, they may not coincide with the actual solutions to the problem in question.
The code used for comparing the output of both algorithms has to be implemented using C# or Python (no other choices will be allowed), and it has to be included as part of your submission. If you do it manually, without using Python or C#, your solution will not be accepted and zero marks will be awarded.
4.4. Report: Which algorithm and database model do you propose to use?
Produce a report (1,000 words maximum) to explain which algorithm you propose to recommend new friends in Facebook, and whether you will implement it using a relational or non-relational database. Your report must be supported by evidence derived from the output you obtained in Section 4.1, Section 4.2 and Section 4.3. For example, if you propose to use influence scoring, you will have to use the results of the previous sections to explain why this is the best option.
IDs of users having the same output under both algorithms:
1960
4900
5880
…
IDs of users having different output under both algorithms:
980
2940
3920
…
Before reaching a conclusion regarding which algorithm you wish to propose, you may want to compare both algorithms using more than the 65 Facebook users whose ID is a multiple of 980, but you will have to refer to the additional results as evidence to support your decision. It may also be the case that you propose a combination of the two algorithms explored above. While this is a suitable alternative, you have to be very clear in terms of explaining how exactly you attempt to combine the algorithms (which one you will apply first, how much weight is assigned to each algorithm, what happens if the algorithms agree or disagree, etc.).
Finally, you may want to combine the two algorithms explored with other ideas of your own (or ideas you have researched separately). You will still have to be very clear in terms of how your proposed algorithm will work and what evidence you present to support this.
5. DELIVERABLES
Submit the following two files via the DLE.
(1) A WORD document containing (documents that are not in WORD format will not be reviewed and zero marks will be awarded):
(a) The solutions for Section 4.1 in the format illustrated by Table 1.
(b) The solutions for Section 4.2. Once again, the solutions should be submitted following the format illustrated by Table 1.
(c) The solutions for Section 4.3. The solutions should be submitted following the format of Table 2.
(d) A report explaining which algorithm you propose to use to make friend recommendations and whether you will implement it using a relational or non- relational database.
(2) The C# or Python code employed to obtain the answers for Section 4.3.
6. ASSESSMENT AND GRADE CRITERIA
Task Marks
1 Recommend by number of common friends
Explanation: See Section 4.1.
Note: This is a core module for the MSc Data Science and Business Analytics; thus, if the submitted Neo4j CQL queries do not execute, marks will not be granted.
Learning Outcomes: ALO2, ALO3. 25
2 Recommend by influence scoring
Explanation: See Section 4.2.
Note: This is a core module for the MSc Data Science and Business Analytics; thus, if the submitted Neo4j CQL queries do not execute, marks will not be granted.
Learning Outcomes: ALO2, ALO3. 25
3 Algorithm comparison (answers) and the source code (C# or Python) employed to compare the results of both algorithms.
Explanation: See Section 4.3.
Note: The code you submit is expected to execute correctly. If the submitted code does not compile and execute, the mark for this part of the assignment will be zero (0).
Learning Outcomes: ALO2, ALO3. 25
4 Report: Which algorithm and database model do you propose to use?
Explanation: See Section 4.4. Learning Outcomes: ALO1, ALO2. 25
7. EXTENUATING CIRCUMSTANCES AND PLAGIARISM
Please note that the University enforces a penalty of zero percent for work submitted after the published deadline without valid extenuating circumstances (ECs).
Also, note that this is an individual assignment and must reflect your own work. Thus, while you may discuss, in general terms, the assignment with your fellow students, the assignment MUST be your own work. Do not share designs or code with anyone, OR submit a program design or code that is, wholly or partially, someone else’s work.
Marks will be awarded according to the following criteria
Mark Grade Criteria
Unprofessional (0-39%) The quality of the work has not met the learning outcomes. Understanding and application of fundamental concepts and techniques is questionable.
Work of this quality would not be acceptable in professional employment.
Poor (40-49%) The quality of work has only met the threshold level but still requires further work to get it to a better standard. Your submission contains logical and analytical errors related to analysis and design techniques. Also, it only demonstrates a basic understanding of the subject competence. Further
improvement is required to demonstrate personal thoroughness, effort and independent learning.
Fair (50-59%) The quality of work submitted suggests that you have demonstrated a fair understanding of the analysis and design techniques. Still, the work you have
submitted contains some errors and incomplete analysis and design. Also, it demonstrates you are able to apply your knowledge but need to improve
understanding of the subject competence and personal thoroughness, effort
and independent learning.
Good (60-69%) The quality of the work submitted suggests that you are able to apply the analysis and design techniques well. The work you have submitted is substantially correct and complete. Also, it demonstrates a good understanding of subject competence and personal thoroughness, effort and
independent learning.
Excellent (70-100%) The quality of work is outstanding with no significant flaws. It demonstrates a high level of subject knowledge and competence; personal thoroughness, effort and independent learning; and possibly significant additional
analytical/critical thought. Well done!