Instructor: Michael Mitzenmacher

E-mail: michaelm AT eecs.harvard.edu

Office: Maxwell Dworkin 331

Phone: 496-7172

Office Hours: TBD (subject to change, depending on conflicts). Or by appointment.

Teaching Assistant: TBD

E-mail: TBD

Office Hours: TBD

Syllabus: www.eecs.harvard.edu/~michaelm/CS222/syllabus.html

Handouts: www.eecs.harvard.edu/~michaelm/CS222/class.html

This course is loosely based on the theme of how to deal with really big data, especially over networks. The topics change from year to year, and the below is subject to change. The course will consist of multiple independent units, covering the major themes of information retrieval (search engines), compression, data summarization algorithms, and coding theory. Although the course will emphasize theoretical foundations, it will definitely be a mix of both theory and practice, and current issues will also be emphasized. The course is meant to show the synthesis of theory and practice; we will often read pairs of papers, one from the "theory community" and one from the "systems community", on the same theme. The course is also meant to promote skills required of graduate students, such as criticial and creative reading and analysis of papers, and research.

The main work of the course will consist of the following: reading and analyzing a number of current and classic research papers; homework assignments based on the material; participating in a mock program committee; and undertaking a final research project. (I have been told the workload of the class is reasonable but non-trivial. If you were looking for a graduate class with no work requirements, please look elsewhere.)

During the semester, you will frequently be reading essentially 2 research papers to prepare for each class. This is more work than it sounds like! You must come to class prepared consistently; if your schedule will not permit that, you should not take the class.

There will be 3 or 4 short homework assignments; I expect to have one for each unit. These are described more below.

The other major work this year is that we will run a "mock program committee". I will choose papers from recent conferences (so ostensibly the are already good) thematically related to the course, and the class will act as a program committee to choose the best ones. This will give you an idea of how program committees work (or don't), and let you see some more up-to-date research, as class reading will be more focused on "classics". This is a fairly experimental approach; I have only done it once. However, it seemed useful for gaining an understanding how program committees function, which seems helpful to graduate students.

Finally, a major component of the class will be a final project, which you will work on for approximately the last 2 months of the course. The hope is that this project may form the foundation of either a research paper or, for undergraduates, a senior thesis. (Last time I taught the course, three projects, with substantially more work, became research papers. Expectations, however, are realistic; research is exploration, and this project is understood to be the beginning, not necessarily the end, of such an exploration.) Although you will need to obtain approval for your project choice, the topic of the final project will primarily be up to you. This project can either be theoretical or implementation based in nature. Generally people work in pairs for the final project, but this is not required. For graduate students, your project can be related to your main line of research; it should not, however, be something you were already working on.

Students should have taken at least CS 124 or its equivalent. Students should be able to program in a standard programming language; C or C++ is preferred. Knowledge of probability will be extremely helpful; if your probability background is weak you should expect to refresh your probability skills on your own. Generally, mathematics will be fundamental to the course, so you should expect to spend time learning some additional mathematics on your own if necessary. Similarly, some prior knowledge of networks and network issues will be very helpful. For students wishing to review important aspects of probability, there are many books available. Sheldon Ross has written several excellent introductory books which should be available in the library. My personal favorite is "Introduction to Probability Models." A more advanced book for those with more background is "Elements of Information Theory" by Cover and Thomas. Another good book is "Information Theory, Inference, and Learning Algorithms" by David Mackay, which has the benefit of being online: This link should work. Of course, my completely biased opinion is that the best book for a computer scientist to buy is by Mitzenmacher and Upfal, "Randomized Algorithms and Probabilistic Analysis." I'd recommend students with less background in probability get one (or more) of these books as a reference.

Your performance will be measured in four ways. (The percentage contributions to your grade given below are approximate and subject to change.)

- Problem sets (20%): There will be 3-4 short problem sets.
Generally they are meant to ensure that introductory material and the
major ideas are being absorbed. They will generally be due one to two
weeks after they are given out. These sets will primarily be
mathematical and/or theoretical in nature, although some
implementation may be required. These assignments are governed by the
collaboration policy, given below.
- Paper summaries (12.5%):
You will also have to regularly turn in paper summaries in a form to
be discussed for papers that we read during the semester. The point
of the summaries is really to ensure that you come to class prepared.
You will be allowed to skip two summaries of your choice during the
course of the semester. Summaries will be due before the class in
which the paper is discussed.
Summaries are to be approximately 250 words (about 1 page). Longer is not better. Summaries should be typed or handwritten extremely neatly. Summaries will be sent in by e-mail before the corresponding class begins.

- Class participation (12.5%): You will be expected to come to class
prepared to discuss the readings, and solve problems based on the
reading. I will be calling on people randomly throughout the
semester. At times you will have to work together in groups in class
to answer questions posed.
- PC participation (15%): You will be expected to write well-thought
out reviews for papers you are assigned as part of the mock PC, and engage
in discussions to choose the final selection of papers from the PC.
- Final Project (40%): The final project will be your major output
in the course. The goal of the final project is to develop a full
understanding of an important open research area, and, to the extent
possible, to work on an open research problem. The final project
will include a major final paper, and (depending on the class size)
may also include an oral presentation.

All assignments will be due at the beginning of class on the appropriate day. Late assignments are not acceptable without the prior consent of the instructor. Consent will generally only be given for significant events or emergencies. Being busy in other classes is generally not a significant event.

If you collaborate with other students in the course in the planning and design of solutions to homework problems, then you should give their names on your homework papers.

Under no circumstances may you use solution sets to problems that may have been distributed by the course in past years, or the homework papers of students who have taken the course past years. Nor should you look up solution sets from other similar courses.

Violation of these rules may be grounds for giving no credit for a homework paper and also for serious disciplinary action.

- Information retrieval and the Web
- Ranking documents: PageRank, Kleinberg's algorithm
- Other uses of link information
- Link prediction

- Compression
- Basics of information theory
- Huffman compression (non-adaptive, adaptive)
- Arithmetic coding
- Lempel-Ziv compression and its variants
- Burrows-Wheeler compression
- JPEG/MPEG/Audio

- Summarization algorithms
- Bloom filters and variants
- Data streams and streaming algorithms
- Similarity metrics and toolkits

- Coding
- Basics of Shannon
- Reed-Solomon codes
- Network coding and gossiping