Schedule

Part Description Dates CS 451/651 Assignments CS 431/631 Assignments
1 MapReduce Algorithm Design Jan 8, 10, 15, 17 A0: Jan 17
2 From MapReduce to Spark Jan 22, 24 A1: Jan 24 A0: Jan 22
3 Analyzing Text Jan 29, 31 A2: Jan 31 A1: Jan 29
4 Analyzing Graphs Feb 5, 7 A3: Feb 7 A2: Feb 12
5 Analyzing Relational Data Feb 12, 14, 26 A3: Feb 23
6 Data Mining and Machine Learning Feb 28, Mar 5, 7, 12 A4: Feb 28
7Mutable State Mar 14, 19 A5: Mar 14 A4: Mar 14
8 Analyzing Graphs, Redux Mar 21, 26
9 Real-Time Analytics Mar 28, Apr 2 A6: Mar 28 A5: Apr 2
10 Looking Ahead Apr 4 A7: Apr 4

Part 1: MapReduce Algorithm Design Jan 8, 10, 15, 17

Topics

  • What's this course about?
  • Why big data?
  • The datacenter is the computer and other "big ideas"
  • MapReduce programming model
  • Cloud computing and datacenters
  • Hadoop API
  • Hadoop physical execution
  • MapReduce design patterns
  • Intermediate aggregation and combiners
  • Partitioning, grouping, sorting, and monoids

Readings

  • Data-Intensive Text Processing with MapReduce
  • Hadoop: The Definitive Guide (4th Edition):
    • Chapter 1: Meet Hadoop
    • Chapter 2: MapReduce
    • Chapter 3: The Hadoop Distributed Filesystem (Focus on the mechanics of the HDFS commands and don't worry so much about learning the Java API all at once—you'll pick it up in time.)
    • Chapter 5: Hadoop I/O (Read sections "Serialization" and "File-Based Data Structures")
    • Chapter 6: Developing a MapReduce Application (Skip sections "Setting Up the Development Environment", "Writing a Unit Test with MRUnit" and "MapReduce Workflows")
    • Chapter 7: How MapReduce Works (Skip section on "Configuration Tuning")
    • Chapter 8: MapReduce Types and Formats
    • Chapter 9: MapReduce Features (Read sections on "Counters", "Sorting", and "Side Data distribution")

Slides

PPTX (Mac) PDF   Part 1a: January 8

PPTX (Mac) PDF   Part 1b: January 10

PPTX (Mac) PDF   Part 1c: January 15

PPTX (Mac) PDF   Part 1d: January 17

Back to top

Part 2: From MapReduce to Spark Jan 22, 24

Topics

  • Evolution of dataflow abstractions
  • MapReduce, Pig, Dryad, Spark, Flink, etc.

Readings

  • Jimmy Lin. Monoidify! Monoids as a Design Principle for Efficient MapReduce Algorithms. arXiv:1304.7544.
  • Learning Spark (Optional):
    • Chapter 1: Introduction to Data Analysis with Spark
    • Chapter 2: Downloading Spark and Getting Started (Skip section on downloading)
    • Chapter 3: Programming with RDDs
    • Chapter 4: Working with Key/Value Pairs
    • Chapter 5: Loading and Saving Your Data (Stop when you get to Structured Data with Spark SQL)

Note that the Spark book is a bit outdated since it covers Spark 1.3; we're using Spark 2.1. All the material in the book can be found in a multitude of sources online, but you'll have to hunt around for resources — the book is useful primarily as single reference that gathers everything together.

Slides

PPTX (Mac) PDF   Part 2a: January 22

PPTX (Mac) PDF   Part 2b: January 24

Back to top

Part 3: Analyzing Text Jan 29, 31

Topics

  • Language models and machine translation
  • Inverted indexing and search

Readings

Slides

PPTX (Mac) PDF   Part 3a: January 29

PPTX (Mac) PDF   Part 3b: January 31

Back to top

Part 4: Analyzing Graphs Feb 5, 7

Topics

  • Graph representations
  • Parallel breadth-first search
  • PageRank and random walks
  • Issues and challenges with dataflow abstractions

Readings

Slides

PPTX (Mac) PDF   Part 4a: February 5

PPTX (Mac) PDF   Part 4b: February 11

Back to top

Part 5: Analyzing Relational Data Feb 12, 14, 26

Topics

  • OLTP vs. OLAP
  • Data warehousing and data lakes, ETL
  • SQL-on-Hadoop: relational data processing with MapReduce and Spark
  • Optimizations for relational processing: row vs. column stores, vectorized processing
  • Semistructured data and record reconstruction (Parquet)

Readings

Slides

PPTX (Mac) PDF   Part 5a: February 12

PPTX (Mac) PDF   Part 5b: February 14

PPTX (Mac) PDF   Part 5c: February 26

Back to top

Part 6: Data Mining and Machine Learning Feb 28, Mar 5, 7, 12

Topics

  • Supervised machine learning: binary classification
  • Logistic regression, gradient descent, stochastic gradient descent, ensemble methods
  • Production machine learning pipelines
  • Hashing: minhash, random projections, etc.
  • Clustering: k-means, Gaussian mixture models

Readings

Slides

PPTX (Mac) PDF   Part 6a: February 28

PPTX (Mac) PDF   Part 6b: March 5

PPTX (Mac) PDF   Part 6c: March 7

PPTX (Mac) PDF   Part 6d: March 12

Back to top

Part 7: Mutable State Mar 14, 19

Topics

  • Bigtable/HBase: Log-structure merge trees
  • Distributed hash tables
  • Consistency, latency, and availability tradeoffs

Readings

Slides

PPTX (Mac) PDF   Part 7a: March 14

PPTX (Mac) PDF   Part 7b: March 19

Back to top

Part 8: Analyzing Graphs, Redux Mar 21, 26

Topics

  • Bulk synchronous parallel: "think like a vertex" (Giraph)
  • Alternative approaches: GraphX

Readings

Slides

PPTX (Mac) PDF   Part 8a: March 21

PPTX (Mac) PDF   Part 8b: March 26

Back to top

Part 9: Real-Time Analytics Mar 28, Apr 2

Topics

  • Stream processing semantics, issues, and frameworks
  • Probabilistic data structures (hyerloglog counters, bloom filters, count-min sketches, etc.)
  • Integrating batch and stream processing

Readings

Slides

PPTX (Mac) PDF   Part 9a: March 28

PPTX (Mac) PDF   Part 9b: April 2

Back to top

Part 10: Looking Ahead Apr 4

Slides

PPTX (Mac) PDF   Bonus: April 4

Back to top