For this assignment, you will be working in the same repo as
before, except that everything should go into the package namespace
ca.uwaterloo.cs451.a4
.
Begin by taking the time to understand
the PageRank
reference implementation in Bespin
(particularly RunPageRankBasic
). For this assignment,
you are going to implement multiple-source personalized PageRank. As
we discussed in class, personalized PageRank is different from
ordinary PageRank in a few respects:
Here are some publications about personalized PageRank if you're interested. They're just provided for background; neither is necessary for completing the assignment.
Your implementation is going to run multiple personalized PageRank
computations in parallel, one with respect to each source. The sources
will be specified on the command line. This means that each
PageRank node object (i.e., Writable
) is going to contain
an array of PageRank values.
Start by working with the Gnutella graph (a snapshot of the Gnutella peer-to-peer file sharing network from August 2002), the same as the one used in the Bespin demo:
$ mkdir data $ curl http://lintool.github.io/bespin-data/p2p-Gnutella08-adj.txt > data/p2p-Gnutella08-adj.txt
Here's how the implementation is going to work: it largely follows
the reference implementation of RunPageRankBasic
. You
must make your implementation work with respect to the command-line
invocations specified below.
First, convert the adjacency list into PageRank node records:
$ hadoop jar target/assignments-1.0.jar \ ca.uwaterloo.cs451.a4.BuildPersonalizedPageRankRecords \ -input data/p2p-Gnutella08-adj.txt -output cs451-bigdatateach-a4-Gnutella-PageRankRecords \ -numNodes 6301 -sources 367,249,145
The -sources
option specifies the source nodes for the
personalized PageRank computations. In this case, we're running three
computations in parallel, with respect to node ids 367, 249, and
145. You can expect the option value to be in the form of a
comma-separated list, and that all node ids actually exist in the
graph. The list of source nodes may be arbitrarily long, but for
practical purposes we won't test your code with more than a few.
Since we're running three personalized PageRank computations in parallel, each PageRank node is going to hold an array of three values, the personalized PageRank values with respect to the first source, second source, and third source. You can expect the array positions to correspond exactly to the position of the node id in the source string.
Next, partition the graph (hash partitioning) and get ready to iterate:
$ hadoop fs -mkdir cs451-bigdatateach-a4-Gnutella-PageRank $ hadoop jar target/assignments-1.0.jar \ ca.uwaterloo.cs451.a4.PartitionGraph \ -input cs451-bigdatateach-a4-Gnutella-PageRankRecords \ -output cs451-bigdatateach-a4-Gnutella-PageRank/iter0000 -numPartitions 5 -numNodes 6301
After setting everything up, iterate multi-source personalized PageRank:
$ hadoop jar target/assignments-1.0.jar \ ca.uwaterloo.cs451.a4.RunPersonalizedPageRankBasic \ -base cs451-bigdatateach-a4-Gnutella-PageRank -numNodes 6301 -start 0 -end 20 -sources 367,249,145
Note that the sources are passed in from the command-line again. Here, we're running twenty iterations.
Finally, run a program to extract the top ten personalized PageRank values, with respect to each source.
$ hadoop jar target/assignments-1.0.jar \ ca.uwaterloo.cs451.a4.ExtractTopPersonalizedPageRankNodes \ -input cs451-bigdatateach-a4-Gnutella-PageRank/iter0020 -output cs451-bigdatateach-a4-Gnutella-PageRank-top10 \ -top 10 -sources 367,249,145
The above program should print the following answer to stdout:
Source: 367 0.35257 367 0.04181 264 0.03889 266 0.03888 559 0.03883 5 0.03860 1317 0.03824 666 0.03817 7 0.03799 4 0.00850 249 Source: 249 0.34089 249 0.04034 251 0.03721 762 0.03688 123 0.03686 250 0.03668 753 0.03627 755 0.03623 351 0.03622 983 0.00949 367 Source: 145 0.36937 145 0.04195 1317 0.04120 265 0.03847 390 0.03606 367 0.03566 246 0.03525 667 0.03519 717 0.03513 149 0.03496 2098
To make the final output easier to read, in the
class ExtractTopPersonalizedPageRankNodes
, use the
following format to print each (personalized PageRank value, node id)
pair:
String.format("%.5f %d", pagerank, nodeid)
This will generate the final results in the same format as above. Also note: print actual probabilities, not log probabilities—although during the actual PageRank computation keep values as log probabilities.
The final class ExtractTopPersonalizedPageRankNodes
does not need to be a MapReduce job (but it does need to read from
HDFS). Obviously, the other classes need to run MapReduce jobs.
The reference implementation of PageRank in Bespin has many options: you can either use in-mapper combining or ordinary combiners. In your implementation, use ordinary combiners. Also, the reference implementation has an option to either use range partitioning or hash partitioning: you only need to implement hash partitioning. You can start with the reference implementation and remove code that you don't need (see #2 below).
To help you out, there's a small helper program in Bespin that computes personalized PageRank using a sequential algorithm. Use it to check your answers:
$ hadoop jar target/bespin-1.0.5-SNAPSHOT-fatjar.jar io.bespin.java.mapreduce.pagerank.SequentialPersonalizedPageRank \ -input data/p2p-Gnutella08-adj.txt -source 367
Note that this isn't actually a MapReduce job; we're simply using
Hadoop to run the main
for convenience. The values from
your implementation should be pretty close to the output of the above
program, but might differ a bit due to convergence issues. After 20
iterations, the output of the MapReduce implementation should match to
at least the fourth decimal place.
This is a complex assignment. We would suggest breaking the implementation into the following steps:
-sources w,x,y,z
, simply
ignore x,y,z
and run personalized PageRank with respect
to w
. This can be accomplished with the
existing PageRankNode
, which holds a single floating
point value.PageRankNode
class to store an array of
floats (length of array is the number of sources) instead of a single
float. Make sure single-source personalized PageRank still runs.In particular, #3 is a nice checkpoint. If you're not able to get the multiple-source personalized PageRank to work, at least completing the single-source implementation will allow us to give you partial credit.
Once you get your implementation working and debugged in the student Linux
environment, you're going to run your code on a non-trivial graph: the
link structure of (all of) Wikipedia. The adjacency lists are stored
in /data/cs451/enwiki-20180901-adj.txt
in HDFS on the Datasci cluster.
The graph has 14,099,242 vertices and 117,394,777 edges.
First, convert the adjacency list into PageRank node records:
$ hadoop jar target/assignments-1.0.jar \ ca.uwaterloo.cs451.a4.BuildPersonalizedPageRankRecords \ -input /data/cs451/enwiki-20180901-adj.txt -output cs451-bigdatateach-a4-wiki-PageRankRecords \ -numNodes 14099242 -sources 73273,73276
Next, partition the graph (hash partitioning) and get ready to iterate:
$ hadoop fs -mkdir cs451-bigdatateach-a4-wiki-PageRank $ hadoop jar target/assignments-1.0.jar \ ca.uwaterloo.cs451.a4.PartitionGraph \ -input cs451-bigdatateach-a4-wiki-PageRankRecords \ -output cs451-bigdatateach-a4-wiki-PageRank/iter0000 -numPartitions 10 -numNodes 14099242
After setting everything up, iterate multi-source personalized PageRank:
$ hadoop jar target/assignments-1.0.jar \ ca.uwaterloo.cs451.a4.RunPersonalizedPageRankBasic \ -base cs451-bigdatateach-a4-wiki-PageRank -numNodes 14099242 -start 0 -end 20 -sources 73273,73276
On the Datasci cluster, each iteration shouldn't take more than a couple of minutes to complete. If it's taking more than five minutes per iteration, you're doing something wrong.
Finally, run a program to extract the top ten personalized PageRank values, with respect to each source.
$ hadoop jar target/assignments-1.0.jar \ ca.uwaterloo.cs451.a4.ExtractTopPersonalizedPageRankNodes \ -input cs451-bigdatateach-a4-wiki-PageRank/iter0020 -output cs451-bigdatateach-a4-wiki-PageRank-top10 \ -top 10 -sources 73273,73276
In the example above, we're running personalized PageRank with
respect to two sources: 73273 and 73276. What articles do these ids
correspond to? There's a file on HDFS
at /data/cs451/enwiki-20180901-titles.txt
that provides the
answer. How do you know if your implementation is correct? You can
sanity check your results by taking the ids and looking up the
articles that they correspond to. Do the results make sense?
Please follow these instructions carefully!
Make sure your repo has the following items:
bigdata2019w/assignment4.md
(more below).ca.uwaterloo.cs451.a4
.Make sure your implementation runs in the Linux student CS environment on the Gnutella graph and on the Wikipedia graph on the Datasci cluster.
In bigdata2019w/assignment4.md
, tell us if you were
able to successfully complete the assignment. This is in case we can't
get your code to run, we need to know if it is because you weren't able
to complete the assignment successfully, or if it is due to some other
issue. If you were not able to implement everything, describe how far
you've gotten. Feel free to use this space to tell us additional
things we should look for in your implementation.
Also, in this file, copy the output of your implementation on the Datasci cluster, i.e., personalized PageRank with respect to vertices 73273 and 73276. This will give us something to look at and check if we can't get your code to run successfully. Something that looks like this:
Source: 73273 0.XXXXX XXX ... Source: 73276 0.XXXXX XXX ...
When grading, we will clone your repo and use the below check scripts:
check_assignment4_public_linux.py
in the Linux Student CS environment.check_assignment4_public_datasci.py
on the Datasci cluster.When you've done everything, commit to your repo and remember to push back to origin. You should be able to see your edits in the web interface. Before you consider the assignment "complete", we would recommend that you verify everything above works by performing a clean clone of your repo and running the public check scripts.
That's it!
The entire assignment is worth 55 points:
In our private check scripts, we will specify arbitrary source nodes to verify the correctness of your implementation.
Note that this grading scheme discretizes each component of the assignment. For example, if you implement everything and it works correctly on the Linux Student CS environment, but can't get it to scale on the Datasci cluster to the larger graph, you'll receive 35 out of 55 points. On the other hand, if you implement single-source personalized PageRank correctly and it runs in both the Linux Student CS environment and the Datasci cluster, you will receive 25 out of 55 points. And combinations thereof.
To help you gauge the efficiency of your solution, we are giving you the running times of our reference implementations. Keep in mind that these are machine dependent and can vary depending on the server/cluster load.
Class name | Running time Linux | Running time Datasci |
---|---|---|
BuildPersonalizedPageRankRecords | 15 seconds | 1 minute 50 seconds |
PartitionGraph | 15 seconds | 2 minutes 50 seconds |
RunPersonalizedPageRankBasic | 75 seconds (20 iterations) | 35 minutes (20 iterations) |
ExtractTopPersonalizedPageRankNodes | 15 seconds | 45 seconds |