Data-Intensive Distributed Computing

Assignment 3: Inverted Indexing due 2:30pm February 7

This assignment is to be completed in MapReduce in Java. You will be working in the same repo as before, except that everything should go into the package namespace ca.uwaterloo.cs451.a3.

Look at the inverted indexing and boolean retrieval implementation in Bespin. Make sure you understand the code. Starting from the inverted indexing baseline BuildInvertedIndex, modify the indexer code in the following ways:

1. Index Compression. The index should be compressed using VInts: see org.apache.hadoop.io.WritableUtils. You should also use gap-compression (i.e., delta-compression) techniques as appropriate.

2. Buffering postings. The baseline indexer implementation currently buffers and sorts postings in the reducer, which as we discussed in class is not a scalable solution. Address this scalability bottleneck using techniques we discussed in class and in the textbook.

3. Term partitioning. The baseline indexer implementation currently uses only one reducer and therefore all postings lists are shuffled to the same node and written to HDFS in a single partition. Change this so we can specify the number of reducers (hence, partitions) as a command-line argument. This is, of course, easy to do, but we need to make sure that the searcher understands this partitioning also.

Note: The major scalability issue is buffering uncompressed postings in memory. In your solution, you'll still end up buffering each postings list, but in compressed form (raw bytes, no additional object overhead). This is fine because if you use the right compression technique, the postings lists are quite small. As a data point, on a collection of 50 million web pages, 2GB heap is more than enough for a full positional index (and in this assignment you're not asked to store positional information in your postings).

To go into a bit more detail: in the reference implementation, the final key type is PairOfWritables<IntWritable, ArrayListWritable<PairOfInts>>. The most obvious idea is to change that into something like PairOfWritables<VIntWritable, ArrayListWritable<PairOfVInts>>. This does not work! The reason is that you will still be materializing each posting, i.e., all PairOfVInts objects in memory. This translates into a Java object for every posting, which is wasteful in terms of memory usage and will exhaust memory pretty quickly as you scale. In other words, you're still buffering objects—just inside the ArrayListWritable.

This new indexer should be named BuildInvertedIndexCompressed. This new class should be in the package ca.uwaterloo.cs451.a3. Make sure it works on the Shakespeare collection.

Modify BooleanRetrieval so that it works with the new compressed indexes. Name this new class BooleanRetrievalCompressed. This new class should be in the same package as above and give the same output as the old version.

Use BuildInvertedIndex and BooleanRetrieval from Bespin as your starting points. That is, copy over into your repo, rename, and begin your assignment from there. Don't unnecessarily change code not directly related to points #1-#3 above. In particular, do not change how the documents are tokenized, etc. in BuildInvertedIndex (otherwise there's no good way to check for the correctness of your algorithm). Also, do not change the fetchLine method in BooleanRetrieval so that everyone's output looks the same.

In more detail, make sure that you can build the inverted index with the following command (make sure your implementation runs in the Linux student CS environment, as that is where we will be doing the marking):

$ hadoop jar target/assignments-1.0.jar ca.uwaterloo.cs451.a3.BuildInvertedIndexCompressed \
   -input data/Shakespeare.txt -output cs451-bigdatateach-a3-index-shakespeare -reducers 4

We should be able to control the number of partitions (#3 above) with the -reducers option. That is, the code should give the correct results no matter what we set the value to.

Once we build the index, we should then be able to run a boolean query as follows (in exactly the same manner as BooleanRetrieval in Bespin):

$ hadoop jar target/assignments-1.0.jar ca.uwaterloo.cs451.a3.BooleanRetrievalCompressed \
   -index cs451-bigdatateach-a3-index-shakespeare -collection data/Shakespeare.txt \
   -query "outrageous fortune AND"

$ hadoop jar target/assignments-1.0.jar ca.uwaterloo.cs451.a3.BooleanRetrievalCompressed \
   -index cs451-bigdatateach-a3-index-shakespeare -collection data/Shakespeare.txt \
   -query "white red OR rose AND pluck AND"

Of course, we will try your program with additional queries to verify its correctness.

Answer the following question:

Question 1. What is the size of your compressed index for Shakespeare collection? Just so we're using the same units, report the output of du -h.

Running on the Datasci cluster

Now let's try running your implementation on the Datasci cluster, on the sample Wikipedia file /data/cs451/enwiki-20180901-sentences-0.1sample.txt on HDFS:

$ hadoop jar target/assignments-1.0.jar ca.uwaterloo.cs451.a3.BuildInvertedIndexCompressed \
   -input /data/cs451/enwiki-20180901-sentences-0.1sample.txt \
   -output cs451-bigdatateach-a3-index-wiki -reducers 4

The Wikipedia sample contains a sentence on each line, so each "document" is actually a sentence. Each sentence begins with the article title and the sentence id, e.g., "Anarchism.0004" is sentence 4 from the article "Anarchism".

And let's try running a query:

$ hadoop jar target/assignments-1.0.jar ca.uwaterloo.cs451.a3.BooleanRetrievalCompressed \
   -index cs451-bigdatateach-a3-index-wiki \
   -collection /data/cs451/enwiki-20180901-sentences-0.1sample.txt \
   -query "waterloo stanford OR cheriton AND"

$ hadoop jar target/assignments-1.0.jar ca.uwaterloo.cs451.a3.BooleanRetrievalCompressed \
   -index cs451-bigdatateach-a3-index-wiki \
   -collection /data/cs451/enwiki-20180901-sentences-0.1sample.txt \
   -query "big data AND hadoop spark OR AND"

Answer the following questions:

Question 2. What is the size of your compressed index for the sample Wikipedia collection? Just so we're using the same units, report the output of hadoop fs -du -h.

Question 3. What are the "documents" (article + sentence) retrieved in response to the query "waterloo stanford OR cheriton AND"?

Question 4. What are the "documents" (article + sentence) retrieved in response to the query "big data AND hadoop spark OR AND"?

Turning in the Assignment

Please follow these instructions carefully!

Make sure your repo has the following items:

Similar to the previous assignments, the answers to the questions go in bigdata2018f/assignment3.md.
The implementations should be in package ca.uwaterloo.cs451.a3.

Make sure your implementation runs in the Linux student CS environment on the Shakespeare collection and also on sample Wikipedia file /data/cs451/enwiki-20180901-sentences-0.1sample.txt on HDFS in the Datasci cluster, per above.

Specifically, we will clone your repo and use the below check scripts:

check_assignment3_public_linux.py in the Linux Student CS environment.
check_assignment3_public_datasci.py on the Datasci cluster.

When you've done everything, commit to your repo and remember to push back to origin. You should be able to see your edits in the web interface. Before you consider the assignment "complete", we would recommend that you verify everything above works by performing a clean clone of your repo and run the public check scripts.

That's it!

Grading

This assignment is worth a total of 50 points, broken down as follows:

Implementation correctness is worth 30 points. Note that the questions above are not explicitly worth any points; they exist primarily to help us gauge your implementation correctness. For example, if your index size is larger than we expect, it's likely you've not applied compression correctly. If your retrieved results do not match ours, it's likely you have a bug in your retrieval implementation.
Getting your code to compile and successfully run is worth another 10 points (5 points for the Linux student CS environment and 5 points on the Datasci cluster). We will make a minimal effort to fix trivial issues with your code (e.g., a typo)—and deduct points—but will not spend time debugging your code. It is your responsibility to make sure your code runs: we have taken care to specify exactly how we will run your code—if anything is unclear, it is your responsibility to seek clarification. In order to get a perfect score of 10 for this portion of the grade, we should be able to run the two public check scripts: check_assignment3_public_linux.py (on Linux Student CS) an check_assignment3_public_datasci.py (on the Datasci cluster) successfully without any errors.
Another 10 points is allotted to us verifying the behavior and output of your program in ways that we will not tell you. We're giving you the "public" versions of the check scripts; we'll run a "private" version to examine your output further (i.e., think blind test cases).

Reference Running Times

To help you gauge the efficiency of your solution, we are giving you the running times of our reference implementations. Keep in mind that these are machine dependent and can vary depending on the server/cluster load.

Class name	Running time Linux	Running time Datasci
BuildInvertedIndexCompressed	20 seconds	5 minutes
BooleanRetrievalCompressed	< 100 ms (without hadoop startup time)	< 250 ms (without hadoop startup time)

Assignments Data-Intensive Distributed Computing (Winter 2019)

Assignment 3: Inverted Indexing due 2:30pm February 7

Running on the Datasci cluster

Turning in the Assignment

Grading

Reference Running Times

Assignments
Data-Intensive Distributed Computing (Winter 2019)