Data-Intensive Distributed Computing

Assignment 2: Counting in Spark due 2:30pm January 31

In this assignment you will do two things:

"Port" the MapReduce implementations of the bigram frequency count program from Bespin over to Spark (in Scala).
"Port" the MapReduce implementations of assignment 1 over to Spark (in Scala).

Bigram Relative Frequency

Your starting points are ComputeBigramRelativeFrequencyPairs and ComputeBigramRelativeFrequencyStripes in package io.bespin.java.mapreduce.bigram (in Java). You are welcome to build on the BigramCount (Scala) implementation here for tokenization and "boilerplate" code like command-line argument parsing. To be consistent in tokenization, you should use (i.e., import) the Tokenizer trait here.

Put your code in the package ca.uwaterloo.cs451.a2. Since you'll be writing Scala code, your source files should go into src/main/scala/ca/uwaterloo/cs451/a2/. The repository is designed so that Scala/Spark code will also compile with the same Maven build command:

$ mvn clean package

Following the Java implementations, you will write both a "pairs" and a "stripes" implementation in Spark. Note that although Spark has a different API than MapReduce, the algorithmic concepts are still very much applicable. Your pairs and stripes implementation should follow the same logic as in the MapReduce implementations. In particular, your program should only take one pass through the input data.

Make sure your implementation runs in the Linux student CS environment on the Shakespeare collection and also on sample Wikipedia file /data/cs451/enwiki-20180901-sentences-0.1sample.txt on HDFS in the Datasci cluster. See the software page for more details on accessing the Datasci.

You can verify the correctness of your algorithm by comparing the output of the MapReduce implementation with your Spark implementation.

Clarification on terminology: informally, we often refer to "mappers" and "reducers" in the context of Spark. That's a shorthand way of saying map-like transformations (map, flatMap, filter, mapPartitions, etc.) and reduce-like transformations (e.g., reduceByKey, groupByKey, aggregateByKey, etc.). Hopefully it's clear from lecture that while Spark represents a generalization of MapReduce, the notions of per-record processing (i.e., map-like transformation) and grouping/shuffling (i.e., reduce-like transformations) are shared across both frameworks.

We are going to run your code in the Linux student CS environment as follows (we will make sure the collection is there):

$ spark-submit --class ca.uwaterloo.cs451.a2.ComputeBigramRelativeFrequencyPairs \
   target/assignments-1.0.jar --input data/Shakespeare.txt \
   --output cs451-lintool-a2-shakespeare-bigrams-pairs --reducers 5

$ spark-submit --class ca.uwaterloo.cs451.a2.ComputeBigramRelativeFrequencyStripes \
   target/assignments-1.0.jar --input data/Shakespeare.txt \
   --output cs451-lintool-a2-shakespeare-bigrams-stripes --reducers 5

We are going to run your code on the Datasci cluster as follows (note the addition of the --num-executors and --executor-cores options):

$ spark-submit --class ca.uwaterloo.cs451.a2.ComputeBigramRelativeFrequencyPairs \
   --num-executors 2 --executor-cores 4 --executor-memory 24G target/assignments-1.0.jar \
   --input /data/cs451/enwiki-20180901-sentences-0.1sample.txt \
   --output cs451-lintool-a2-wiki-bigram-pairs --reducers 8

$ spark-submit --class ca.uwaterloo.cs451.a2.ComputeBigramRelativeFrequencyStripes \
   --num-executors 2 --executor-cores 4 --executor-memory 24G target/assignments-1.0.jar \
   --input /data/cs451/enwiki-20180901-sentences-0.1sample.txt \
   --output cs451-lintool-a2-wiki-bigram-stripes --reducers 8

Important: Make sure that your code accepts the command-line parameters above!

When you run a Spark job (in distributed mode), you need to specify how much cluster resource to request. The option --num-executors specifies the number of executors, each with a certain number of cores specified by --executor-cores. So, in the above commands, we request a total of 8 workers (2 executors, 4 cores each).

The --reducers flag is the amount of parallelism that you set in your program in the reduce stage. If the total number of workers is larger than --reducers, some of the workers will be sitting idle, since you've allocated more workers for the job than the parallelism you've specified in your program. If --reducers is larger than the number of workers, on the other hand, then your reduce tasks will queue up at the workers, i.e., a worker will be assigned more than one reduce task. In the above example we set the two equal.

Note that the setting of these two parameters should not affect the correctness of your program. The setting above is a reasonable middle ground between having your jobs finish in a reasonable amount of time and not monopolizing cluster resources.

A related but still orthogonal concept is partitions. Partitions describes the physical division of records across workers during execution. When reading from HDFS, the number of HDFS blocks determines the number of partitions in your RDD. When you apply a reduce-like transformation, you can optionally specify the number of partitions (or Spark applies a default) — in this case, the number of partitions is equal to the number of reducers.

PMI

Your starting points for PMI computations in Spark should be your solutions to assignment 1. Write two programs, PairsPMI and StripesPMI that go in package ca.uwaterloo.cs451.a2, in src/main/scala/ca/uwaterloo/cs451/a2/.

There are obviously going to be differences in the MapReduce and Spark implementations, but we want you to preserve the "spirit" of the "pairs" vs. "stripes" approach in your respective implementations. That is, the pairs implementation keeps track of each co-occurring counts independently, while the stripes implementation groups all co-occurring terms with respect to a term. If you have questions, please ask.

We are going to run your code in the Linux student CS environment as follows (we will make sure the collection is there):

$ spark-submit --class ca.uwaterloo.cs451.a2.PairsPMI \
   target/assignments-1.0.jar --input data/Shakespeare.txt \
   --output cs451-lintool-a2-shakespeare-pmi-pairs --reducers 5 --threshold 10

$ spark-submit --class ca.uwaterloo.cs451.a2.StripesPMI \
   target/assignments-1.0.jar --input data/Shakespeare.txt \
   --output cs451-lintool-a2-shakespeare-pmi-stripes --reducers 5 --threshold 10

We are going to run your code on the Datasci cluster as follows (we'll use the same simple Wikipedia collection at /data/cs451/simplewiki-20180901-sentences.txt from assignment 1):

$ spark-submit --class ca.uwaterloo.cs451.a2.PairsPMI \
   --num-executors 2 --executor-cores 4 --executor-memory 24G target/assignments-1.0.jar \
   --input /data/cs451/simplewiki-20180901-sentences.txt \
   --output cs451-lintool-a2-wiki-pmi-pairs --reducers 8 --threshold 10

$ spark-submit --class ca.uwaterloo.cs451.a2.StripesPMI \
   --num-executors 2 --executor-cores 4 --executor-memory 24G target/assignments-1.0.jar \
   --input /data/cs451/simplewiki-20180901-sentences.txt \
   --output cs451-lintool-a2-wiki-pmi-stripes --reducers 8 --threshold 10

Hints:

Use broadcast variables.
Spark accumulators may be helpful

Turning in the Assignment

Please follow these instructions carefully!

All implementations should be in package ca.uwaterloo.cs451.a2; your Scala code should be in src/main/scala/ca/uwaterloo/cs451/a2/. There are no questions to answer in this assignment unless there is something you would like to communicate with us, and if so, put it in assignment2.md.

When grading, we will pull your repo and build your code:

$ mvn clean package

And run using exactly the commands above. Make sure that your code runs in the Linux Student CS environment (even if you do development on your own machine), which is where we will be doing the grading. "But it runs on my laptop!" will not be accepted as an excuse if we can't get your code to run.

Specifically, we will clone your repo and use the below check scripts:

check_assignment2_public_linux.py in the Linux Student CS environment.
check_assignment2_public_datasci.py on the Datasci cluster.

When you've done everything, commit to your repo and remember to push back to origin. You should be able to see your edits in the web interface. Before you consider the assignment "complete", we would recommend that you verify everything above works by performing a clean clone of your repo and going through the steps above.

That's it!

Grading

This assignment is worth a total of 40 points, broken down as follows:

The bigram relative frequency implementations are worth 20 points total, 5 point for each of the following conditions: pairs on Linux Student CS, pairs on the Datasci cluster, stripes on Linux Student CS, stripes on the Datasci cluster.
The PMI implementations are worth 20 points total, 5 point for each of the following conditions: pairs on Linux Student CS, pairs on the Datasci cluster, stripes on Linux Student CS, stripes on the Datasci cluster.

There are no points explicitly for hidden test cases: the points are folded into the distribution above.

Reference Running Times

To help you gauge the efficiency of your solution, we are giving you the running times of our reference implementations. Keep in mind that these are machine dependent and can vary depending on the server/cluster load.

Class name	Running time Linux	Running time Datasci
ComputeBigramRelativeFrequencyPairs	30 seconds	6 minutes 30 seconds
ComputeBigramRelativeFrequencyStripes	25 seconds	2 minutes 30 seconds
PairsPMI	45 seconds	10 minutes
StripesPMI	25 seconds	3 minutes

Assignments Data-Intensive Distributed Computing (Winter 2019)

Assignment 2: Counting in Spark due 2:30pm January 31

Bigram Relative Frequency

PMI

Turning in the Assignment

Grading

Reference Running Times

Assignments
Data-Intensive Distributed Computing (Winter 2019)