Assignment 4: Multi-Source Personalized PageRank due 2:30pm February 28

For this assignment, you will be working in the same repo as before, except that everything should go into the package namespace ca.uwaterloo.cs451.a4.

Begin by taking the time to understand the PageRank reference implementation in Bespin (particularly RunPageRankBasic). For this assignment, you are going to implement multiple-source personalized PageRank. As we discussed in class, personalized PageRank is different from ordinary PageRank in a few respects:

Here are some publications about personalized PageRank if you're interested. They're just provided for background; neither is necessary for completing the assignment.

Your implementation is going to run multiple personalized PageRank computations in parallel, one with respect to each source. The sources will be specified on the command line. This means that each PageRank node object (i.e., Writable) is going to contain an array of PageRank values.

Start by working with the Gnutella graph (a snapshot of the Gnutella peer-to-peer file sharing network from August 2002), the same as the one used in the Bespin demo:

$ mkdir data
$ curl http://lintool.github.io/bespin-data/p2p-Gnutella08-adj.txt > data/p2p-Gnutella08-adj.txt

Here's how the implementation is going to work: it largely follows the reference implementation of RunPageRankBasic. You must make your implementation work with respect to the command-line invocations specified below.

First, convert the adjacency list into PageRank node records:

$ hadoop jar target/assignments-1.0.jar \
   ca.uwaterloo.cs451.a4.BuildPersonalizedPageRankRecords \
   -input data/p2p-Gnutella08-adj.txt -output cs451-bigdatateach-a4-Gnutella-PageRankRecords \
   -numNodes 6301 -sources 367,249,145

The -sources option specifies the source nodes for the personalized PageRank computations. In this case, we're running three computations in parallel, with respect to node ids 367, 249, and 145. You can expect the option value to be in the form of a comma-separated list, and that all node ids actually exist in the graph. The list of source nodes may be arbitrarily long, but for practical purposes we won't test your code with more than a few.

Since we're running three personalized PageRank computations in parallel, each PageRank node is going to hold an array of three values, the personalized PageRank values with respect to the first source, second source, and third source. You can expect the array positions to correspond exactly to the position of the node id in the source string.

Next, partition the graph (hash partitioning) and get ready to iterate:

$ hadoop fs -mkdir cs451-bigdatateach-a4-Gnutella-PageRank

$ hadoop jar target/assignments-1.0.jar \
   ca.uwaterloo.cs451.a4.PartitionGraph \
   -input cs451-bigdatateach-a4-Gnutella-PageRankRecords \
   -output cs451-bigdatateach-a4-Gnutella-PageRank/iter0000 -numPartitions 5 -numNodes 6301

After setting everything up, iterate multi-source personalized PageRank:

$ hadoop jar target/assignments-1.0.jar \
   ca.uwaterloo.cs451.a4.RunPersonalizedPageRankBasic \
   -base cs451-bigdatateach-a4-Gnutella-PageRank -numNodes 6301 -start 0 -end 20 -sources 367,249,145

Note that the sources are passed in from the command-line again. Here, we're running twenty iterations.

Finally, run a program to extract the top ten personalized PageRank values, with respect to each source.

$ hadoop jar target/assignments-1.0.jar \
   ca.uwaterloo.cs451.a4.ExtractTopPersonalizedPageRankNodes \
   -input cs451-bigdatateach-a4-Gnutella-PageRank/iter0020 -output cs451-bigdatateach-a4-Gnutella-PageRank-top10 \
   -top 10 -sources 367,249,145

The above program should print the following answer to stdout:

Source: 367
0.35257 367
0.04181 264
0.03889 266
0.03888 559
0.03883 5
0.03860 1317
0.03824 666
0.03817 7
0.03799 4
0.00850 249

Source: 249
0.34089 249
0.04034 251
0.03721 762
0.03688 123
0.03686 250
0.03668 753
0.03627 755
0.03623 351
0.03622 983
0.00949 367

Source: 145
0.36937 145
0.04195 1317
0.04120 265
0.03847 390
0.03606 367
0.03566 246
0.03525 667
0.03519 717
0.03513 149
0.03496 2098

Additional Specifications

To make the final output easier to read, in the class ExtractTopPersonalizedPageRankNodes, use the following format to print each (personalized PageRank value, node id) pair:

String.format("%.5f %d", pagerank, nodeid)

This will generate the final results in the same format as above. Also note: print actual probabilities, not log probabilities—although during the actual PageRank computation keep values as log probabilities.

The final class ExtractTopPersonalizedPageRankNodes does not need to be a MapReduce job (but it does need to read from HDFS). Obviously, the other classes need to run MapReduce jobs.

The reference implementation of PageRank in Bespin has many options: you can either use in-mapper combining or ordinary combiners. In your implementation, use ordinary combiners. Also, the reference implementation has an option to either use range partitioning or hash partitioning: you only need to implement hash partitioning. You can start with the reference implementation and remove code that you don't need (see #2 below).

Hints and Suggestion

To help you out, there's a small helper program in Bespin that computes personalized PageRank using a sequential algorithm. Use it to check your answers:

$ hadoop jar target/bespin-1.0.5-SNAPSHOT-fatjar.jar io.bespin.java.mapreduce.pagerank.SequentialPersonalizedPageRank \
   -input data/p2p-Gnutella08-adj.txt -source 367

Note that this isn't actually a MapReduce job; we're simply using Hadoop to run the main for convenience. The values from your implementation should be pretty close to the output of the above program, but might differ a bit due to convergence issues. After 20 iterations, the output of the MapReduce implementation should match to at least the fourth decimal place.

This is a complex assignment. We would suggest breaking the implementation into the following steps:

  1. First, copy the reference PageRank implementation into your own assignments repo (renaming the classes appropriately). Make sure you can get it to run and output the correct results with ordinary PageRank.
  2. Simplify the code; e.g., remove the code that uses in-mapper combiners.
  3. Implement personalized PageRank from a single source; that is, if the user sets option -sources w,x,y,z, simply ignore x,y,z and run personalized PageRank with respect to w. This can be accomplished with the existing PageRankNode, which holds a single floating point value.
  4. Extend the PageRankNode class to store an array of floats (length of array is the number of sources) instead of a single float. Make sure single-source personalized PageRank still runs.
  5. Implement multi-source personalized PageRank.

In particular, #3 is a nice checkpoint. If you're not able to get the multiple-source personalized PageRank to work, at least completing the single-source implementation will allow us to give you partial credit.

Running on the Datasci cluster

Once you get your implementation working and debugged in the student Linux environment, you're going to run your code on a non-trivial graph: the link structure of (all of) Wikipedia. The adjacency lists are stored in /data/cs451/enwiki-20180901-adj.txt in HDFS on the Datasci cluster. The graph has 14,099,242 vertices and 117,394,777 edges.

First, convert the adjacency list into PageRank node records:

$ hadoop jar target/assignments-1.0.jar \
   ca.uwaterloo.cs451.a4.BuildPersonalizedPageRankRecords \
   -input /data/cs451/enwiki-20180901-adj.txt -output cs451-bigdatateach-a4-wiki-PageRankRecords \
   -numNodes 14099242 -sources 73273,73276

Next, partition the graph (hash partitioning) and get ready to iterate:

$ hadoop fs -mkdir cs451-bigdatateach-a4-wiki-PageRank

$ hadoop jar target/assignments-1.0.jar \
   ca.uwaterloo.cs451.a4.PartitionGraph \
   -input cs451-bigdatateach-a4-wiki-PageRankRecords \
   -output cs451-bigdatateach-a4-wiki-PageRank/iter0000 -numPartitions 10 -numNodes 14099242

After setting everything up, iterate multi-source personalized PageRank:

$ hadoop jar target/assignments-1.0.jar \
   ca.uwaterloo.cs451.a4.RunPersonalizedPageRankBasic \
   -base cs451-bigdatateach-a4-wiki-PageRank -numNodes 14099242 -start 0 -end 20 -sources 73273,73276

On the Datasci cluster, each iteration shouldn't take more than a couple of minutes to complete. If it's taking more than five minutes per iteration, you're doing something wrong.

Finally, run a program to extract the top ten personalized PageRank values, with respect to each source.

$ hadoop jar target/assignments-1.0.jar \
   ca.uwaterloo.cs451.a4.ExtractTopPersonalizedPageRankNodes \
   -input cs451-bigdatateach-a4-wiki-PageRank/iter0020 -output cs451-bigdatateach-a4-wiki-PageRank-top10 \
   -top 10 -sources 73273,73276

In the example above, we're running personalized PageRank with respect to two sources: 73273 and 73276. What articles do these ids correspond to? There's a file on HDFS at /data/cs451/enwiki-20180901-titles.txt that provides the answer. How do you know if your implementation is correct? You can sanity check your results by taking the ids and looking up the articles that they correspond to. Do the results make sense?

Turning in the Assignment

Please follow these instructions carefully!

Make sure your repo has the following items:

Make sure your implementation runs in the Linux student CS environment on the Gnutella graph and on the Wikipedia graph on the Datasci cluster.

In bigdata2019w/assignment4.md, tell us if you were able to successfully complete the assignment. This is in case we can't get your code to run, we need to know if it is because you weren't able to complete the assignment successfully, or if it is due to some other issue. If you were not able to implement everything, describe how far you've gotten. Feel free to use this space to tell us additional things we should look for in your implementation.

Also, in this file, copy the output of your implementation on the Datasci cluster, i.e., personalized PageRank with respect to vertices 73273 and 73276. This will give us something to look at and check if we can't get your code to run successfully. Something that looks like this:

Source: 73273
0.XXXXX XXX
...

Source: 73276
0.XXXXX XXX
...

When grading, we will clone your repo and use the below check scripts:

When you've done everything, commit to your repo and remember to push back to origin. You should be able to see your edits in the web interface. Before you consider the assignment "complete", we would recommend that you verify everything above works by performing a clean clone of your repo and running the public check scripts.

That's it!

Grading

The entire assignment is worth 55 points:

In our private check scripts, we will specify arbitrary source nodes to verify the correctness of your implementation.

Note that this grading scheme discretizes each component of the assignment. For example, if you implement everything and it works correctly on the Linux Student CS environment, but can't get it to scale on the Datasci cluster to the larger graph, you'll receive 35 out of 55 points. On the other hand, if you implement single-source personalized PageRank correctly and it runs in both the Linux Student CS environment and the Datasci cluster, you will receive 25 out of 55 points. And combinations thereof.

Reference Running Times

To help you gauge the efficiency of your solution, we are giving you the running times of our reference implementations. Keep in mind that these are machine dependent and can vary depending on the server/cluster load.

Class name Running time Linux Running time Datasci
BuildPersonalizedPageRankRecords 15 seconds 1 minute 50 seconds
PartitionGraph 15 seconds 2 minutes 50 seconds
RunPersonalizedPageRankBasic 75 seconds (20 iterations) 35 minutes (20 iterations)
ExtractTopPersonalizedPageRankNodes 15 seconds 45 seconds

Back to top