CS 222 -- Algorithms at the End of the Wire
Handouts and Class Materials
- [8/1] I'm told the course videos are at
https://matterhorn.dce.harvard.edu/engage/ui/index.html#/2017/01/15042 . For right now they should be public.
- [7/30] We'll have a Piazza page up. Go to
to sign up, and
for the Piazza page. If you cannot sign up, please send me an e-mail with the e-mail you
would like to use for Piazza and I will add you manually. Both the Harvard and extension class
will use the same Piazza page.
- [7/30] We will plan to use Canvas to turn in assignments. Please make sure you are
signed up to Canvas. There should be (separate) Canvas systems set up for both CS222 and the Extension
- [7/30] We'll have a link to online lectures here as soon as possible.
Basic Class Information
All reading dates are tentative and will be confirmed in class.
We may go faster, we may go slower. We may have to move things
for other reasons.
You should consider the assigned papers a minimum of what you should be
reading for this class. Feel free to explore on the Web or otherwise
(and additional suggested readings for each topic will be listed as well). We are just
touching the surface of these topics; there's much more out there.
Unit 1: Search Engines and the Importance of Links
- 9/1 Background: Review Markov chains.
- Start with
wikipedia. The first external link is to a useful book chapter
on Markov chains.
Additional useful papers for Unit 1:
- Graph Structure in the Web
by Broder et al.
- The Link Database: Fast Access to Graphs of the Web
by Randall et al.
The Eigentrust Algorithm for Reputation Management in P2P Networks
by Kamvar, Schlosser, and Garcia-Molina.
Trust-Based Recommendation Systems
by Andersen et al.
PicASHOW: Pictorial Authority Search by Hyperlinks on the Web. by Lempel
The Stochastic Approach for Link-Structure Analysis (SALSA) and the TKC Effect
by Lempel and Moran.
Power Laws, Pareto Distributions, and Zipf's Law.
by Mark Newman.
On Power-Law Relationships of the Internet Topology.
by Faloutsos, Faloutsos, and Faloutsos.
Power-Law Distributions in Empirical Data.
by Clauset, Shalizi, and Newman.
The Anatomy of the Long Tail.
by Goel, Broder, Gabrilovich, and Pang.
- Aggregating Inconsistent Information.
by Ailon, Charikar, Newman.
- Editorial: The Future of Power Law Research.
Unit 2: Compression and Basic Information Theory
- 9/20, and ongoing: We will be covering the basics of compression and
information theory, including Huffman coding, arithmetic coding, LZ-style coding, etc.
Some good online introductions to the material include:
Information Theory, Inference, and Learning Algorithms, specifically part 1 (Data Compression) of Mackay's Book. (Though the whole book is good.)
Introduction to Data Compression,
notes by Guy Blelloch.
- 9/20: Note: for today, there's a talk and a paper. Hopefully you've seen the talk, if not now is the time.
- 9/20: Dasher : a Data Entry Interface Using Continuous Gestures and Language Models
by Ward, Blackwell, and MacKay.
- 9/20: Please see the talk on Dasher here .
- Questions for 9/28: Explain the connection between Dasher and compression, particularly arithmetic coding.
- 9/22: On Compressing Social Networks, by F. Chierichetti, R. Kumar, S. Lattanzi, M. Mitzenmacher, A. Panconesi, and P. Raghavan.
- 9/22: Permuting Web and Social Graphs, by P. Boldi, M. Santini, and S. Vigna.
- 9/22: You may at some point also went to check this paper Degree Sequence of a Scale-Free Random Graph Process, to get some of the models suggested in the previous paper. We'll look at other papers on this point later.
- Questions for 9/22: How are compressing social networks and the web alike, and different? What role does having an underlying model
play in determining how to compress these types of structures? What properties of the model(s) appear important?
- 9/29 : With the assignment due, no reading for today; however, I expect we'll still be going over Burrows-Wheeler
and Sequitur for this class, so please be prepared.
Additional useful papers for Unit 2:
Unit 3: Data Streams and Data Summaries
- As a general resource, my colleague Justin Thaler is teaching
a course on streaming algorithms this semester; this class webpage is a fantastic resource for information on this big topic. Start here!
Data Streams: Algorithms and Applications. by Muthukrishnan.
Computing on Data Streams by Henzinger, Raghavan, and Rajagopalan.
- Questions for 10/2: How is computing on a data stream different from a "standard"
input-output algorithm? What are the goals, how do you measure
resource usage, what sorts of problems are considered? Where has
computing on data streams had successes? What lower bound techniques
are there showing that computing on data streams can be hard?
Network Applications of Bloom Filters: A Survey. by Broder and Mitzenmacher.
Invertible Bloom Lookup Tables by Goodrich and Mitzenmacher.
- Questions for 10/13:
I've been told that all a Bloom filter does is save a small constant factor in space -- do you agree with this assessment, and whether you do or not, do you think this makes a Bloom filter uninteresting? How could you imagine extending the
Bloom filter? What would you wish a Bloom filter could do that it doesn't?
The IBLT paper focuses on questions relating to the synchronization of sets between
a pair of agents. Can you think of good uses for this solution? Or generalizations to this question?
- 10/18 MapReduce: Simplified Data Processing on Large Clusters
by Dean and Ghemawat.
- 10/18 A Model of Computation for MapReduce by Karloff, Suri and Vassilvitskii.
- Questions for 10/18: Compare and contrast the theory and practice of the MapReduce paradigm. What kind of tasks might MapReduce not be good for?
Are there any eventual scaling problems for this paradigm? Suppose you had access to a large-scale MapReduce system -- what would you
most want to use it for?
Cuckoo Hashing. by Pagh and Rodler
- Questions for 10/25: There are several recent systems papers that suggest uses for cuckoo hashing (or variants)
in real systems. Go find one or more -- give the reference, and explain why cuckoo hashing is (or is not) a seemingly good solution
for the problem the paper is trying to solve.
Additional useful papers for Unit 3: