Assignment 3: Inverted Indexing due 2:30pm February 7

This assignment is to be completed in MapReduce in Java. You will be working in the same repo as before, except that everything should go into the package namespace ca.uwaterloo.cs451.a3.

Look at the inverted indexing and boolean retrieval implementation in Bespin. Make sure you understand the code. Starting from the inverted indexing baseline BuildInvertedIndex, modify the indexer code in the following ways:

1. Index Compression. The index should be compressed using VInts: see org.apache.hadoop.io.WritableUtils. You should also use gap-compression (i.e., delta-compression) techniques as appropriate.

2. Buffering postings. The baseline indexer implementation currently buffers and sorts postings in the reducer, which as we discussed in class is not a scalable solution. Address this scalability bottleneck using techniques we discussed in class and in the textbook.

3. Term partitioning. The baseline indexer implementation currently uses only one reducer and therefore all postings lists are shuffled to the same node and written to HDFS in a single partition. Change this so we can specify the number of reducers (hence, partitions) as a command-line argument. This is, of course, easy to do, but we need to make sure that the searcher understands this partitioning also.

Note: The major scalability issue is buffering uncompressed postings in memory. In your solution, you'll still end up buffering each postings list, but in compressed form (raw bytes, no additional object overhead). This is fine because if you use the right compression technique, the postings lists are quite small. As a data point, on a collection of 50 million web pages, 2GB heap is more than enough for a full positional index (and in this assignment you're not asked to store positional information in your postings).

To go into a bit more detail: in the reference implementation, the final key type is PairOfWritables<IntWritable, ArrayListWritable<PairOfInts>>. The most obvious idea is to change that into something like PairOfWritables<VIntWritable, ArrayListWritable<PairOfVInts>>. This does not work! The reason is that you will still be materializing each posting, i.e., all PairOfVInts objects in memory. This translates into a Java object for every posting, which is wasteful in terms of memory usage and will exhaust memory pretty quickly as you scale. In other words, you're still buffering objects—just inside the ArrayListWritable.

This new indexer should be named BuildInvertedIndexCompressed. This new class should be in the package ca.uwaterloo.cs451.a3. Make sure it works on the Shakespeare collection.

Modify BooleanRetrieval so that it works with the new compressed indexes. Name this new class BooleanRetrievalCompressed. This new class should be in the same package as above and give the same output as the old version.

Use BuildInvertedIndex and BooleanRetrieval from Bespin as your starting points. That is, copy over into your repo, rename, and begin your assignment from there. Don't unnecessarily change code not directly related to points #1-#3 above. In particular, do not change how the documents are tokenized, etc. in BuildInvertedIndex (otherwise there's no good way to check for the correctness of your algorithm). Also, do not change the fetchLine method in BooleanRetrieval so that everyone's output looks the same.

In more detail, make sure that you can build the inverted index with the following command (make sure your implementation runs in the Linux student CS environment, as that is where we will be doing the marking):

$ hadoop jar target/assignments-1.0.jar ca.uwaterloo.cs451.a3.BuildInvertedIndexCompressed \
   -input data/Shakespeare.txt -output cs451-bigdatateach-a3-index-shakespeare -reducers 4

We should be able to control the number of partitions (#3 above) with the -reducers option. That is, the code should give the correct results no matter what we set the value to.

Once we build the index, we should then be able to run a boolean query as follows (in exactly the same manner as BooleanRetrieval in Bespin):

$ hadoop jar target/assignments-1.0.jar ca.uwaterloo.cs451.a3.BooleanRetrievalCompressed \
   -index cs451-bigdatateach-a3-index-shakespeare -collection data/Shakespeare.txt \
   -query "outrageous fortune AND"

$ hadoop jar target/assignments-1.0.jar ca.uwaterloo.cs451.a3.BooleanRetrievalCompressed \
   -index cs451-bigdatateach-a3-index-shakespeare -collection data/Shakespeare.txt \
   -query "white red OR rose AND pluck AND"

Of course, we will try your program with additional queries to verify its correctness.

Answer the following question:

Question 1. What is the size of your compressed index for Shakespeare collection? Just so we're using the same units, report the output of du -h.

Running on the Datasci cluster

Now let's try running your implementation on the Datasci cluster, on the sample Wikipedia file /data/cs451/enwiki-20180901-sentences-0.1sample.txt on HDFS:

$ hadoop jar target/assignments-1.0.jar ca.uwaterloo.cs451.a3.BuildInvertedIndexCompressed \
   -input /data/cs451/enwiki-20180901-sentences-0.1sample.txt \
   -output cs451-bigdatateach-a3-index-wiki -reducers 4

The Wikipedia sample contains a sentence on each line, so each "document" is actually a sentence. Each sentence begins with the article title and the sentence id, e.g., "Anarchism.0004" is sentence 4 from the article "Anarchism".

And let's try running a query:

$ hadoop jar target/assignments-1.0.jar ca.uwaterloo.cs451.a3.BooleanRetrievalCompressed \
   -index cs451-bigdatateach-a3-index-wiki \
   -collection /data/cs451/enwiki-20180901-sentences-0.1sample.txt \
   -query "waterloo stanford OR cheriton AND"

$ hadoop jar target/assignments-1.0.jar ca.uwaterloo.cs451.a3.BooleanRetrievalCompressed \
   -index cs451-bigdatateach-a3-index-wiki \
   -collection /data/cs451/enwiki-20180901-sentences-0.1sample.txt \
   -query "big data AND hadoop spark OR AND"

Answer the following questions:

Question 2. What is the size of your compressed index for the sample Wikipedia collection? Just so we're using the same units, report the output of hadoop fs -du -h.

Question 3. What are the "documents" (article + sentence) retrieved in response to the query "waterloo stanford OR cheriton AND"?

Question 4. What are the "documents" (article + sentence) retrieved in response to the query "big data AND hadoop spark OR AND"?

Turning in the Assignment

Please follow these instructions carefully!

Make sure your repo has the following items:

Make sure your implementation runs in the Linux student CS environment on the Shakespeare collection and also on sample Wikipedia file /data/cs451/enwiki-20180901-sentences-0.1sample.txt on HDFS in the Datasci cluster, per above.

Specifically, we will clone your repo and use the below check scripts:

When you've done everything, commit to your repo and remember to push back to origin. You should be able to see your edits in the web interface. Before you consider the assignment "complete", we would recommend that you verify everything above works by performing a clean clone of your repo and run the public check scripts.

That's it!

Grading

This assignment is worth a total of 50 points, broken down as follows:

Reference Running Times

To help you gauge the efficiency of your solution, we are giving you the running times of our reference implementations. Keep in mind that these are machine dependent and can vary depending on the server/cluster load.

Class name Running time Linux Running time Datasci
BuildInvertedIndexCompressed 20 seconds 5 minutes
BooleanRetrievalCompressed < 100 ms (without hadoop startup time) < 250 ms (without hadoop startup time)

Back to top