CS 222 -- Algorithms at the End of the Wire

Handouts and Class Materials

Announcements

[10/24] Please remember to sign into the mock trial system and put in paper preferences. See timeline in Assignments below.
[9/8] Remember Monday's class 9/12 will be by zoom. The link is available in Canvas. Class will be in person 9/14. 9/19 is not clear at this point; I may switch things around and have a class reading + video instead if I can't make it back.
[9/8] People have asked about projects. I'm purposely not getting into details yet -- I'd like people to read papers, think a bit about possible topics, etc. -- but we will start discussing at the end of the month. In broad strokes you will first provide a project proposal (providing your team, a description of what you want to do for your project, a preliminary look at recent work), and the end result will be a paper, roughly 15-20 pages in length. I'm putting up the project description from last year so that you can have a look if you like -- it will be essentially the same (dates will change). See assignments below.
[8/25] Read the Bloom filter paper for sure before the first class.
[7/30] We will plan to use Gradescope+Canvas to turn in assignments. Please make sure you are signed up to Gradescope+Canvas. (This probably won't be set up for a couple weeks into the class.)

Basic Class Information

Assignments

[10/26] Here is Assignment 2 . Due December 1.
[10/24] I will be making assignments for the mock PC tomorrow evening. You are expected to have signed into hotcrp (please put your name in your account settings if it's not already set up) and given paper preferences before then; you will have a reduced grade if you have not done so.
Timeline for the mock PC: reviews due November 16. November 16-23 will be online discussion period. Mock PC happens in class 11/28 and 11/30.
[9/26] Here are some (recent) old projects that became papers. Do keep in mind that these are after a bunch of additional work -- this is a lot more polished and a lot more work went in after the project phase!
[9/26] Here is Assignment 1 . Due October 21.
[9/10] I've updated this to what is the (draft subject to change but probably final) 2022 project description document . I think the dates are right.
[7/30] No assignments currently, but you might check out the class pre-readings below.

Readings

All reading dates are tentative and will be confirmed in class. We may go faster, we may go slower. We may have to move things for other reasons. You should consider the assigned papers a minimum of what you should be reading for this class. Feel free to explore on the Web or otherwise (and additional suggested readings for each topic will be listed as well). We are just touching the surface of these topics; there's much more out there.

Unit 0: Fun Stuff to Start us off

Please read all these, preferably before class begins, to see if you're interested.

Class 0: Network Applications of Bloom Filters: A Survey. by Broder and Mitzenmacher. This will be our first paper, covering background material.
Class 0 How to read a research paper notes.
Class 0: On your own background, make sure you know Markov chains.
Start with wikipedia. The fourth external link is to a useful book chapter on Markov chains.

Unit 1: Data Sketches (and Using Predictions)

Class 1 [8/31]: Class Introduction (syllabus, expectations, etc.)
Class 1 [8/31]: Network Applications of Bloom Filters: A Survey. by Broder and Mitzenmacher.
Class Null [9/5]: Labor Day Holiday.
Class 2 [9/7]: Approximate counting sketches
Class 2 [9/7]: New directions in traffic measurement and accounting: focusing on the elephants, ignoring the mice by Estan and Varghese.
Class 2 [9/7]: The count-min sketch and its applications by Cormode and Muthukrishnan.
- Discussion Questions: Compare contrast the mice/elephants paper and the count-min sketch paper? How do they describe and define the underlying problem(s) they are considering? How do they formalize their solution(s)? How do they compare?
Class 3 [9/12]: Algorithms with Predictions Introduction
Class 3 [9/12]: The Case for Learned Index Structures by Kraska, Beutel, Chi, Dean, and Polyzotis.
Class 3 [9/12]: Algorithms with Predictions by Mitzenmacher and Vassilvitskii.
- Discussion Questions: Explain, in your words, how learning can be used to improve algorithms and data structures. What are the goals of this approach? What are the possible benefits, and possible pitfalls? What problems seem amenable to this type of attack?
Class 4 [9/14]: Range Filters (Eric Knorr lecture)
Class 4 [9/14]: Approximate Range Emptiness in Constant Time and Optimal Space by Goswami, Gronlund, Green, and Pagh.
Class 4 [9/14]: Proteus: A Self-Designing Range Filter by Knorr, Lemaire, Lim, Luo, Zhang, Idreos, and Mitzenmacher.
- Discussion Questions: Explain how to use an encoding argument to achieve data structure lower bounds. Explain how a worst-case lower bound can be avoided by a system such as Proteus.
Class 5 [9/19]: Still Working On It [To Be Filled in Shortly]. Will confirm the below.
Class 5 [9/19]: A Brief History of Generative Models for Power Law and Lognormal Distributions. by Mitzenmacher.
Class 5 [9/19]: Power-Law Distributions in Empirical Data. by Clauset, Shalizi, and Newman.
Class 5 [9/19]: NOT REQUIRED, ADDITIONAL READING IF INTERESTED. Scale-free networks are rare by Broido and Clauset.
Class 5 [9/19]: NOT REQUIRED, ADDITIONAL READING IF INTERESTED. Editorial: The Future of Power Law Research. by Mitzenmacher.
- Discussion Questions: Explain some of the controversy behind power law network/scale-free networks in research work. Where can you go wrong when you say, "Here is a power law" in a paper? How instead can you go right? Why is finding a power law interesting (or not).

Additional useful papers/places for Unit 1

Algoriths with Predictions Metabibliography
Streaming, Sketching and Sufficient Statistics I , [*** first 40 minutes, through Count-Min sketch] by Graham Cormode.
Cuckoo hashing for undergraduates by Pagh.
The Bloomier Filter by Chazelle, Kilian, Rubinfeld, and Tal.
Compressed Bloom Filters by Mitzenmacher.
Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters by Graf and Lemire.
Encoding Arguments by Morin, Mulzer, and Reddad.
SuRF: Practical Range Query Filtering with Fast Succinct Tries by Zhang, Lim, Leis, Andersen, Kaminsky, Keeton, Pavlo.
Rosetta: A Robust Space-Time Optimized Range Filter for Key-Value Stores by Luo, Chatterjee, Ketsetsidis, Dayan, Qin, and Idreos.
Graham Cormode's page -- lots of papers on streaming..
My past student Justin Thaler's page -- lots of papers on streaming..
Andrew McGregor's page -- lots of papers on streaming..

We're not really "done" with unit one yet, but I'd like to introduce some additional topics early. We'll get back to data sketches again as the class continues.

Unit 2: Link Information and Web History

Class 6 [9/21]: Links as information
Class 6 [9/21]: Authoritative Sources in a Hyperlinked Environment. by Jon Kleinberg.
Class 6 [9/21]: The PageRank Citation Algorithm. by Brin, Page, Motwani, Winograd.
- Discussion Questions: How does PageRank differ from Kleinberg's algorithm? How is it the same? Can you think of ways to improve Kleinberg's algorithm, or PageRank?
Class 7 [9/26]: Studying the Web
Class 7 [9/26]: Improved Algorithms for Topic Distillation in a Hyperlinked Environment. by Bharat and Henzinger.
Class 7 [9/26]: Analysis of a Very Large Altavista Query Log. Henzinger, Marais, Moricz, and Silverstein.
- Discussion Questions: What techniques are suggested for improving on Kleinberg's algorithm? Do they appear worthwhile given the costs? What sort of data do the authors try to mine from the query log? Which seems the most useful? Can you think of anything they should be looking for but did not?

Class 8 [9/28]: On the Importance of Links
Class 8 [9/28]: Rank Aggregation Methods for the Web. by Dwork, Kumar, Naor, Sivakumar. An alternative version .
Class 8 [9/28]: The Link Prediction Problem for Social Networks by Liben-Nowell and Kleinberg.
- Discussion Questions: Define Spearman and Kendall distance. Why and how are Markov chains useful for combining rankings? What, from the paper, are the best methods for link prediction? Can you think of additional methods they might not have tried? How might you improve on their methodology?

Additional useful papers/places for Unit 2

Graph Structure in the Web by Broder et al.
The Link Database: Fast Access to Graphs of the Web by Randall et al.
The Eigentrust Algorithm for Reputation Management in P2P Networks by Kamvar, Schlosser, and Garcia-Molina.
Trust-Based Recommendation Systems by Andersen et al.
PicASHOW: Pictorial Authority Search by Hyperlinks on the Web. by Lempel and Soffer.
The Stochastic Approach for Link-Structure Analysis (SALSA) and the TKC Effect by Lempel and Moran.
Power Laws, Pareto Distributions, and Zipf's Law. by Mark Newman.
On Power-Law Relationships of the Internet Topology. by Faloutsos, Faloutsos, and Faloutsos.
The Anatomy of the Long Tail. by Goel, Broder, Gabrilovich, and Pang.

Unit 3: Compression and Basic Information Theory

Class 9 [10/3]: We will be covering the basics of compression and information theory, including Huffman coding, arithmetic coding, LZ-style coding, etc. Some good online introductions to the material include:

Information Theory, Inference, and Learning Algorithms, specifically part 1 (Data Compression) of Mackay's Book. (Though the whole book is good.)
Introduction to Data Compression, notes by Guy Blelloch.

Class 9 [10/3]: No discussion questions today! For class today the plan is you should review the Blelloch notes on compression, and I will lecture/we'll do problems in class. We will be focusing on Section 1-3 and 5 of the Blelloch notes for this class; we will also be covering other things form the notes (Burrows-Wheeler, JPEG/MPEG, etc.) with papers coming up, so it makes a good reference, feel free to read the rest if you like.

Class 10 [10/5]: On Compressing Social Networks, by F. Chierichetti, R. Kumar, S. Lattanzi, M. Mitzenmacher, A. Panconesi, and P. Raghavan.
Class 10 [10/5]: Permuting Web and Social Graphs, by P. Boldi, M. Santini, and S. Vigna.
- Discussion questions: How are compressing social networks and the web alike, and different? What role does having an underlying model play in determining how to compress these types of structures? What properties of the model(s) appear important?

Class 11 [10/12]: Compression from Transformation
Class 11 [10/12]: A Block-Sorting Lossless Data Compression Aglorithm. by Burrows and Wheeler.
Class 11 [10/12]: A perhaps easier-to-read description can be found at Data Compression with the Burrows-Wheeler Transform. by Mark Nelson.
Class 11 [10/12]: An old but useful paper on JPEG.
Class 11 [10/12]: You may also want to look at Wikipedia's take on JPEG .
- Discussion questions: These papers both follow a theme of transforming the data before actually compressing it. Discuss the different types of transformations used; what properties do they have, and how are they important for both compression and efficiency of compression?

Class 12 [10/17]: MPEG, and DASH
Class 12 [10/17]: MPEG: A video compression standard for multimedia applications
Class 12 [10/17]: The MPEG-DASH standard for multimedia streaming over the internet .
Class 12 [10/17]: Plenty of stuff online to look at if you want to add to this, such as descriptions of MPEG dash and HLS here and here ,.
- Discussion questions: No discussion for today; read the papers, but focus on project proposals!

Class 13 [10/19]: Compression and ML
Class 13 [10/19]: Weightless: Lossy weight encoding for deep neural network compression
Class 13 [10/19]: DRIVE: One-Bit Distributed Mean Estimation
Class 13 [10/19]: You do not have to read them, but if you find yourselves interested, there are follow-up papers EDEN: Communication-Efficient and Robust Distributed Mean Estimation for Federated Learning and QUIC-FL: Quick Unbiased Compression for Federated Learning
- Discussion questions: Explain why lossy compression is particularly useful in the setting of machine learning systems.

Class 14 [10/24]: Program Committee Preparation
Class 14 [10/24]: How Not to Review a Paper. by Cormode. This is just for general background, you will not need to write up anything for discussion, but look over before today's class.
Class 14 [10/24]: Thoughts on Reviewing. by Allman. This is just for general background, you will not need to write up anything for discussion, but look over before today's class.

Class 15 [10/26]: Distributed Computing / MapReduce
Class 15 [10/26]: MapReduce: Simplified Data Processing on Large Clusters by Dean and Ghemawat.
Class 15 [10/26]: A Model of Computation for MapReduce by Karloff, Suri and Vassilvitskii.
Class 15 [10/26]: Dewitt and Stonebreaker's take on MapReduce.
- Discussion questions: Compare and contrast the theory and practice of the MapReduce paradigm. What kind of tasks might MapReduce not be good for? Are there any eventual scaling problems for this paradigm? Suppose you had access to a large-scale MapReduce system -- what would you most want to use it for? What do you think of the criticism of MapReduce?

Class 16 [10/31]: Hashing 1 (Hashing Properties)
Class 16 [10/31]: Min-wise independent permtuations by Broder, Charikar, Frieze, and Mitzenmacher. Note: just read the first few sections if you want, it's hard going.
Class 16 [10/31]: Why Simple Hash Functions Work: Exploiting the Entropy in a Data Stream> by Vadhan and Mitzenmacher. (Full version is This paper)
- Discussion questions: What are we looking for in a good hash function? What are the sort of properties we might want? How might it depend on the context?

Class 17 [11/02]: Hashing 2: Effective Hashing
Class 17 [11/02]: High Speed Hashing for Integers and Strings by Thorup.
Class 17 [11/02]: Entropy-Learned Hashing Constant Time Hashing with Controllable Uniformity by Hentschel, Sirin, Idreos.
- Discussion questions: What are the issues in designing high-speed practical hash functions? What theoretical issues arise?

Class 18 [11/07]: Dasher
Class 18 [11/07]: Please see the talk on Dasher here . (I'd recommend seeing the talk first and then reading the paper below, but do whatever order you like.)
Class 18 [11/07]: Dasher : a Data Entry Interface Using Continuous Gestures and Language Models by Ward, Blackwell, and MacKay.
Class 18 [11/07]: Please also read How to Give a Bad Talk. and How to Give a Good Talk.
- Discussion questions for 11/7: Briefly explain the connection between Dasher and compression, particularly arithmetic coding. Did you think the presentation was a good talk or a bad talk; give reasons.

Class 19 [11/09]: Guest Speaker Andrei Broder, 8pm, on office hour zoom link (see class announcements for link).
Class 19 [11/09]: Graph Structure in the Web Andrei Broder and many others.
Class 19 [11/09]: On the resemblance and containment of documents
Class 19 [11/09]: A Note on Double Pooling Tests.
Discussion questions for 11/9: How does the shape of the Web graph match or not match your intuitions? Do you think it has changed since this study, and how? Do you think this resemblance mechanism described in this paper would work for plagiarism detection? How might someone try to avoid being detected as plagiarizing, and do you think the algorithm could be extended to handle your approach for avoiding detection? What do you think about the concept of double pooling?

Class 20 [11/14]: No class; you are to be working to complete your reviews for the Mock Program Committee; all reviews should be turned in by 11/16.

Class 21 [11/16]: Guest speaker Kapil Vaidya. (Class will be in-person live as always -- come to class!)
Class 21 [11/16]: Partitioned Learned Bloom Filter .
Class 21 [11/16]: SNARF: A Learning-Enhanced Range Filter .
- Discussion questions for 11/16: How do these works improve or differ on previous data structures you have seen for these filtering problems? Also: Kapil is a very recent PhD graduate, now working at Amazon. Please think of a question you'd like to ask him, either about the process of getting a PhD, or what it's like moving into post-graduate work.

Class 22 [11/21]: Class cancelled for Thanksgiving week. (No class 11/23 either.) You are expected to use this time to participate in online discussions for the Mock Program Committee.

Class 23 [11/28]: Mock Program Committee, Day 1; Come prepared to talk about your papers

Class 24 [11/30]: Mock Program Committee, Day 2; Come prepared to talk about your papers

Additional useful papers/places for Unit 3

Description of Brotli
Description of VP8
Description of WebP
Towards Compressing Web Graphs by Adler and Mitzenmacher.
The Webgraph Framework 1: Compression Techniques by Boldi and Vigna.
On the Implementation of Minimum Redundancy Prefix Codes by Moffat and Turpin.
Check out the WebGraph home page.