Data-Intensive Distributed Computing

Schedule

Part	Description	Dates	CS 451/651 Assignments	CS 431/631 Assignments
1	MapReduce Algorithm Design	Jan 8, 10, 15, 17	A0: Jan 17
2	From MapReduce to Spark	Jan 22, 24	A1: Jan 24	A0: Jan 22
3	Analyzing Text	Jan 29, 31	A2: Jan 31	A1: Jan 29
4	Analyzing Graphs	Feb 5, 7	A3: Feb 7	A2: Feb 12
5	Analyzing Relational Data	Feb 12, 14, 26		A3: Feb 23
6	Data Mining and Machine Learning	Feb 28, Mar 5, 7, 12	A4: Feb 28
7	Mutable State	Mar 14, 19	A5: Mar 14	A4: Mar 14
8	Analyzing Graphs, Redux	Mar 21, 26
9	Real-Time Analytics	Mar 28, Apr 2	A6: Mar 28	A5: Apr 2
10	Looking Ahead	Apr 4	A7: Apr 4

Part 1: MapReduce Algorithm Design Jan 8, 10, 15, 17

Topics

What's this course about?
Why big data?
The datacenter is the computer and other "big ideas"
MapReduce programming model
Cloud computing and datacenters
Hadoop API
Hadoop physical execution
MapReduce design patterns
Intermediate aggregation and combiners
Partitioning, grouping, sorting, and monoids

Readings

Data-Intensive Text Processing with MapReduce
Hadoop: The Definitive Guide (4th Edition):
- Chapter 1: Meet Hadoop
- Chapter 2: MapReduce
- Chapter 3: The Hadoop Distributed Filesystem (Focus on the mechanics of the HDFS commands and don't worry so much about learning the Java API all at once—you'll pick it up in time.)
- Chapter 5: Hadoop I/O (Read sections "Serialization" and "File-Based Data Structures")
- Chapter 6: Developing a MapReduce Application (Skip sections "Setting Up the Development Environment", "Writing a Unit Test with MRUnit" and "MapReduce Workflows")
- Chapter 7: How MapReduce Works (Skip section on "Configuration Tuning")
- Chapter 8: MapReduce Types and Formats
- Chapter 9: MapReduce Features (Read sections on "Counters", "Sorting", and "Side Data distribution")

Slides

PPTX (Mac) PDF Part 1a: January 8

PPTX (Mac) PDF Part 1b: January 10

PPTX (Mac) PDF Part 1c: January 15

PPTX (Mac) PDF Part 1d: January 17

Part 2: From MapReduce to Spark Jan 22, 24

Topics

Evolution of dataflow abstractions
MapReduce, Pig, Dryad, Spark, Flink, etc.

Readings

Jimmy Lin. Monoidify! Monoids as a Design Principle for Efficient MapReduce Algorithms. arXiv:1304.7544.
Learning Spark (Optional):
- Chapter 1: Introduction to Data Analysis with Spark
- Chapter 2: Downloading Spark and Getting Started (Skip section on downloading)
- Chapter 3: Programming with RDDs
- Chapter 4: Working with Key/Value Pairs
- Chapter 5: Loading and Saving Your Data (Stop when you get to Structured Data with Spark SQL)

Note that the Spark book is a bit outdated since it covers Spark 1.3; we're using Spark 2.1. All the material in the book can be found in a multitude of sources online, but you'll have to hunt around for resources — the book is useful primarily as single reference that gathers everything together.

Slides

PPTX (Mac) PDF Part 2a: January 22

PPTX (Mac) PDF Part 2b: January 24

Part 3: Analyzing Text Jan 29, 31

Topics

Language models and machine translation
Inverted indexing and search

Readings

Data-Intensive Text Processing with MapReduce — Chapter 4: Inverted Indexing for Text Retrieval

Slides

PPTX (Mac) PDF Part 3a: January 29

PPTX (Mac) PDF Part 3b: January 31

Part 4: Analyzing Graphs Feb 5, 7

Topics

Graph representations
Parallel breadth-first search
PageRank and random walks
Issues and challenges with dataflow abstractions

Readings

Data-Intensive Text Processing with MapReduce — Chapter 5: Graph Algorithms

Slides

PPTX (Mac) PDF Part 4a: February 5

PPTX (Mac) PDF Part 4b: February 11

Part 5: Analyzing Relational Data Feb 12, 14, 26

Topics

OLTP vs. OLAP
Data warehousing and data lakes, ETL
SQL-on-Hadoop: relational data processing with MapReduce and Spark
Optimizations for relational processing: row vs. column stores, vectorized processing
Semistructured data and record reconstruction (Parquet)

Readings

Data-Intensive Text Processing with MapReduce — Chapter 6: Processing Relational Data
MapReduce: A major step backwards
Chaudhuri et al. (2011) An overview of business intelligence technology, CACM, 54(8):88-98.

Slides

PPTX (Mac) PDF Part 5a: February 12

PPTX (Mac) PDF Part 5b: February 14

PPTX (Mac) PDF Part 5c: February 26

Part 6: Data Mining and Machine Learning Feb 28, Mar 5, 7, 12

Topics

Supervised machine learning: binary classification
Logistic regression, gradient descent, stochastic gradient descent, ensemble methods
Production machine learning pipelines
Hashing: minhash, random projections, etc.
Clustering: k-means, Gaussian mixture models

Readings

Tom Mitchell. Naive Bayes and Logistic Regression. (This book chapter serves as supplemental reading and goes into classification in more detail than in lecture.)
Deisenroth et al., Mathematics for Machine Learning: Chapter 12, Classification with Support Vector Machines. (Optional supplemental reading)
Deisenroth et al., Mathematics for Machine Learning: Chapter 11, Density Estimation with Gaussian Mixture Models. (This book chapter serves as supplemental reading and goes into clustering with Gaussian mixture models in more detail than in lecture.)
Jimmy Lin and Dmitriy Ryaboy. Scaling Big Data Mining Infrastructure: The Twitter Experience, SIGKDD Explorations, 14(2):6-19, 2012.

Slides

PPTX (Mac) PDF Part 6a: February 28

PPTX (Mac) PDF Part 6b: March 5

PPTX (Mac) PDF Part 6c: March 7

PPTX (Mac) PDF Part 6d: March 12

Part 7: Mutable State Mar 14, 19

Topics

Bigtable/HBase: Log-structure merge trees
Distributed hash tables
Consistency, latency, and availability tradeoffs

Readings

The original Bigtable paper.
The original DHT paper.
Daniel Abadi. Consistency Tradeoffs in Modern Distributed Database System Design, Computer, 45(2):37-42, 2012.

Slides

PPTX (Mac) PDF Part 7a: March 14

PPTX (Mac) PDF Part 7b: March 19

Part 8: Analyzing Graphs, Redux Mar 21, 26

Topics

Bulk synchronous parallel: "think like a vertex" (Giraph)
Alternative approaches: GraphX

Readings

Sherif Sakr. Large-Scale Graph Processing Systems, 2016.

Slides

PPTX (Mac) PDF Part 8a: March 21

PPTX (Mac) PDF Part 8b: March 26

Part 9: Real-Time Analytics Mar 28, Apr 2

Topics

Stream processing semantics, issues, and frameworks
Probabilistic data structures (hyerloglog counters, bloom filters, count-min sketches, etc.)
Integrating batch and stream processing

Readings

Zaharia et al. Discretized Streams: Fault-Tolerant Streaming Computation at Scale, SOSP 2013.
Kulkarni et al. Twitter Heron: Stream Processing at Scale, SIGMOD 2015.
Apache Beam: The world beyond batch: Streaming 101, Streaming 102.
If you're interested, here's my rant about the Lambda and Kappa architectures.

Slides

PPTX (Mac) PDF Part 9a: March 28

PPTX (Mac) PDF Part 9b: April 2

Part 10: Looking Ahead Apr 4

Slides

PPTX (Mac) PDF Bonus: April 4

Syllabus Data-Intensive Distributed Computing (Winter 2019)

Schedule

Part 1: MapReduce Algorithm Design Jan 8, 10, 15, 17

Topics

Readings

Slides

Part 2: From MapReduce to Spark Jan 22, 24

Topics

Readings

Slides

Part 3: Analyzing Text Jan 29, 31

Topics

Readings

Slides

Part 4: Analyzing Graphs Feb 5, 7

Topics

Readings

Slides

Part 5: Analyzing Relational Data Feb 12, 14, 26

Topics

Readings

Slides

Part 6: Data Mining and Machine Learning Feb 28, Mar 5, 7, 12

Topics

Readings

Slides

Part 7: Mutable State Mar 14, 19

Topics

Readings

Slides

Part 8: Analyzing Graphs, Redux Mar 21, 26

Topics

Readings

Slides

Part 9: Real-Time Analytics Mar 28, Apr 2

Topics

Readings

Slides

Part 10: Looking Ahead Apr 4

Slides

Syllabus
Data-Intensive Distributed Computing (Winter 2019)