- [8/1] I'm told the course videos are at https://matterhorn.dce.harvard.edu/engage/ui/index.html#/2017/01/15042 . For right now they should be public.
- [7/30] We'll have a Piazza page up. Go to piazza.com/harvard/fall2016/cs222 to sign up, and piazza.com/harvard/fall2016/cs222/home for the Piazza page. If you cannot sign up, please send me an e-mail with the e-mail you would like to use for Piazza and I will add you manually. Both the Harvard and extension class will use the same Piazza page.
- [7/30] We will plan to use Canvas to turn in assignments. Please make sure you are signed up to Canvas. There should be (separate) Canvas systems set up for both CS222 and the Extension Class E-210.
- [7/30] We'll have a link to online lectures here as soon as possible.

- Assignment 1. Due Sept 29.
- Assignment 2. Due Oct 20.
- Project Description. Preliminary project writeup also due Oct 20.

- 9/1 Background: Review Markov chains.
- Start with wikipedia. The first external link is to a useful book chapter on Markov chains.

- Note: current plans are that I will be out, lecture given by Tom Morgan.
- 9/6: Authoritative Sources in a Hyperlinked Environment. by Jon Kleinberg.
- 9/6: The PageRank Citation Algorithm. by Brin, Page, Motwani, Winograd.
- Questions for 9/6: How does PageRank differ from Kleinberg's algorithm? How is it the same? Can you think of ways to improve Kleinberg's algorithm, or PageRank?

- Note: current plans are that I will be out. Check here for updates. If no class, please watch the following videos.
- Dasher: information-efficient text entry , by David MacKay.
- Streaming, Sketching and Sufficient Statistics I , [*** first 40 minutes, through Count-Min sketch] by Graham Cormode.

- 9/13: Improved Algorithms for Topic Distillation in a Hyperlinked Environment. by Bharat and Henzinger.
- 9/13: Analysis of a Very Large Altavista Query Log. Henzinger, Marais, Moricz, and Silverstein.
- Questions for 9/13: What techniques are suggested for improving on Kleinberg's algorithm? Do they appear worthwhile given the costs?
- Questions for 9/13: What sort of data do they try to mine from the query log? Which seems the most useful? Can you think of anything they should be looking for but did not?

- 9/15: Rank Aggregation Methods for the Web. by Dwork, Kumar, Naor, Sivakumar. An alternative version .
- 9/15: The Link Prediction Probelm for Social Networks by Liben-Nowell and Kleinberg.
- Questions for 9/15: Define Spearman and Kendall distance. Why and how are Markov chains useful for combining rankings?
- Questions for 9/15: What, from the paper, are the best methods for link prediction? Can you think of additional methods they might not have tried? How might you improve on their methodology?

Additional useful papers for Unit 1:

- Graph Structure in the Web by Broder et al.
- The Link Database: Fast Access to Graphs of the Web by Randall et al.
- The Eigentrust Algorithm for Reputation Management in P2P Networks by Kamvar, Schlosser, and Garcia-Molina.
- Trust-Based Recommendation Systems by Andersen et al.
- PicASHOW: Pictorial Authority Search by Hyperlinks on the Web. by Lempel and Soffer.
- The Stochastic Approach for Link-Structure Analysis (SALSA) and the TKC Effect by Lempel and Moran.
- Power Laws, Pareto Distributions, and Zipf's Law. by Mark Newman.
- On Power-Law Relationships of the Internet Topology. by Faloutsos, Faloutsos, and Faloutsos.
- Power-Law Distributions in Empirical Data. by Clauset, Shalizi, and Newman.
- The Anatomy of the Long Tail. by Goel, Broder, Gabrilovich, and Pang.
- Aggregating Inconsistent Information. by Ailon, Charikar, Newman.
- Editorial: The Future of Power Law Research. by Mitzenmacher.

- 9/20, and ongoing: We will be covering the basics of compression and
information theory, including Huffman coding, arithmetic coding, LZ-style coding, etc.
Some good online introductions to the material include:

Information Theory, Inference, and Learning Algorithms, specifically part 1 (Data Compression) of Mackay's Book. (Though the whole book is good.)

Introduction to Data Compression, notes by Guy Blelloch.

- 9/20: Note: for today, there's a talk and a paper. Hopefully you've seen the talk, if not now is the time.
- 9/20: Dasher : a Data Entry Interface Using Continuous Gestures and Language Models by Ward, Blackwell, and MacKay.
- 9/20: Please see the talk on Dasher here .
- Questions for 9/28: Explain the connection between Dasher and compression, particularly arithmetic coding.

- 9/22: On Compressing Social Networks, by F. Chierichetti, R. Kumar, S. Lattanzi, M. Mitzenmacher, A. Panconesi, and P. Raghavan.
- 9/22: Permuting Web and Social Graphs, by P. Boldi, M. Santini, and S. Vigna.
- 9/22: You may at some point also went to check this paper Degree Sequence of a Scale-Free Random Graph Process, to get some of the models suggested in the previous paper. We'll look at other papers on this point later.
- Questions for 9/22: How are compressing social networks and the web alike, and different? What role does having an underlying model play in determining how to compress these types of structures? What properties of the model(s) appear important?

- 9/27: A Block-Sorting Lossless Data Compression Aglorithm. by Burrows and Wheeler.
- A perhaps easier-to-read description can be found at Data Compression with the Burrows-Wheeler Transform. by Mark Nelson.
- 9/27: Identifying Hierarchical Structure in Sequences: A Linear-Time Algorithm by Nevill-Manning and Witten.
- Questions for 9/27: These papers both follow a theme of transforming the data before actually compressing it. Discuss the different types of transformations used; what properties do they have, and how are they important for both compression and efficiency of compression?

- 9/29 : With the assignment due, no reading for today; however, I expect we'll still be going over Burrows-Wheeler and Sequitur for this class, so please be prepared.

- 10/4: An old but useful paper on JPEG.
- 10/4: Wikipedia's take on JPEG ; Wikipedia's take on JPEG 2000; Wikipedia's take on MPEG, especially MPEG-2 (but also 1 and 4).
- Questions for 10/4: Explain the DCT. How does JPEG 2000 differ from the original JPEG. Explain the similarities and differences between MPEG and JPEG. Discuss the reasons for different types of frames.

Additional useful papers for Unit 2:

- Towards Compressing Web Graphs by Adler and Mitzenmacher.
- The Webgraph Framework 1: Compression Techniques by Boldi and Vigna.
- Algorithms to Accelerate Multiple Regular Expressions Matching for Deep Packet Inspection by Kumar, Dharmapurikar, Yu, Crowley, Turner.
- On the Hardness of Finding Multiple Preset Dictionaries by Mitzenmacher
- On the Implementation of Minimum Redundancy Prefix Codes by Moffat and Turpin.
- Check out the WebGraph home page.
- Approximating Algorithmss for Grammar Based Compression by Lehman and Shelat.
- Approximating the Smallest Grammar by many
- Succincter by Patrascu.
- Changing Base Without Losing Space by Dodis, Patrascu, Thorup.

- As a general resource, my colleague Justin Thaler is teaching a course on streaming algorithms this semester; this class webpage is a fantastic resource for information on this big topic. Start here!

- 10/6 Data Streams: Algorithms and Applications. by Muthukrishnan.
- 10/6 Computing on Data Streams by Henzinger, Raghavan, and Rajagopalan.
- Questions for 10/2: How is computing on a data stream different from a "standard" input-output algorithm? What are the goals, how do you measure resource usage, what sorts of problems are considered? Where has computing on data streams had successes? What lower bound techniques are there showing that computing on data streams can be hard?

- 10/11: New directions in traffic measurement and accounting: focusing on the elephants, ignoring the mice by Estan and Varghese.
- 10/11: The count-min sketch and its applications by Cormode and Muthukrishnan.
- Questions for 10/11: Compare contrast the mice/elephants paper and the count-min sketch paper? How do they describe and define the underlying problem(s) they are considering? How do they formalize their solution(s)? How do they compare?

- 10/13: Network Applications of Bloom Filters: A Survey. by Broder and Mitzenmacher.
- 10/13: Invertible Bloom Lookup Tables by Goodrich and Mitzenmacher.
- Questions for 10/13: I've been told that all a Bloom filter does is save a small constant factor in space -- do you agree with this assessment, and whether you do or not, do you think this makes a Bloom filter uninteresting? How could you imagine extending the Bloom filter? What would you wish a Bloom filter could do that it doesn't? The IBLT paper focuses on questions relating to the synchronization of sets between a pair of agents. Can you think of good uses for this solution? Or generalizations to this question?

- 10/18 MapReduce: Simplified Data Processing on Large Clusters by Dean and Ghemawat.
- 10/18 A Model of Computation for MapReduce by Karloff, Suri and Vassilvitskii.
- Questions for 10/18: Compare and contrast the theory and practice of the MapReduce paradigm. What kind of tasks might MapReduce not be good for? Are there any eventual scaling problems for this paradigm? Suppose you had access to a large-scale MapReduce system -- what would you most want to use it for?

- 10/20 Mumps Paper
- 10/20 Google Flu Paper

- 10/25 Cuckoo Hashing. by Pagh and Rodler
- 10/25 Cuckoo Filter.
- Questions for 10/25: There are several recent systems papers that suggest uses for cuckoo hashing (or variants) in real systems. Go find one or more -- give the reference, and explain why cuckoo hashing is (or is not) a seemingly good solution for the problem the paper is trying to solve.

- 10/27 How Not to Review a Paper. by Cormode
- 10/27 Thoughts on Reviewing. by Allman
- 10/27 How to Give a Bad Talk.
- 10/27 How to Give a Good Talk.
- No writeup is necessary, but read and be ready to discuss reviews/talks.

Additional useful papers for Unit 3:

- Graham Cormode's page -- lots of papers on streaming..
- My student Justin Thaler's page -- lots of papers on streaming..
- Andrew McGregor's page -- lots of papers on streaming..
- Compressed Bloom Filters by Mitzenmacher.
- Beyond Bloom Filters: From Approximate Membership Checks to Approximate State Machines, by Bonomi, Mitzenmacher, Panigrahy, Singh, and Varghese.
- Survey of hash-type algorithms for networks stuff.
- Why Simple Hash Functions Work: Using the Entropy in a Data Stream. Mitzenmacher and Vadhan.
- A bibliographic list of old streaming papers.
- For fun, read Dewitt and Stonebreaker's take on MapReduce.
- Distance-Senstive Bloom Filters by Kirsch and Mitzenmacher.
- The Bloomier Filter by Chazelle, Kilian, Rubinfeld, and Tal.
- Survey of hash-type algorithms for networks stuff.
- Why Simple Hash Functions Work: Using the Entropy in a Data Stream. Mitzenmacher and Vadhan.