My office phone number is 873-4562 (8am-5pm), and my home phone number is 643-9644 (6pm-10pm). Please do not call me after 10pm, or before 10am on weekends.
By far the best way to reach me (or anyone on the course staff) is via email. My email address is ellard@das. The email address of the course account is lib50@fas.
email is also the easiest way for me to get information to you. Please check your mail at once a day or so.
You should have memorized all this stuff by now...
The first thing that we must come to grips with is that there are other approaches to this problem. To some people, huff and puff seemed so hard that finding one solution was hard enough, so it might be annoying to hear that there are many other possible approaches. This is the case, however, and exploring some of these will, I hope, stir up some interesting thoughts.
One flaw that many people made in their design was to not provide a high enough level of abstraction (or functional decomposition) for their forest data structure. (In fact, some people even merged their forest and tree data structures together, which defeats this almost entirely.) If you design your program around a forest that is implemented by a linked list, it's very easy for the assumption that your forest is a linked list to start creeping into the rest of your code. For the sake of this assignment, this is not a big problem; it's OK to make assumptions based on a data structure if you know that the data structure is both sufficient to the purpose and will not change over the life of your program- which was certainly the case for most of you.
Our first step, therefore, will be to banish these assumptions. We'll define a functional interface to for the forest routines, and replace all references to the forest with these functions.
Here is a brief description of the functions. (Note that the type specifications are deliberately vague.)
Note that forest_init does not fill in the initial values of the forest (i.e. the NUM_CHARS initial nodes).
Can we use arrays to represent the forest? The answer is yes, as long as we can implement the interface functions:
Luckily, we can specify the size ahead of time, since we know that for huff and puff the forest will contain at most NUM_CHARS elements. If we create a forest large enough to hold that many elements, we'll always have enough (and actually always have more than enough, except for the brief instant when we use them all).
Here's a way to fix this problem- to add a new node:
Unfortunately, the array is not sorted, and so the position of the appropriate node is unknown- it might even be the last one (the most recently added node). Therefore, the appropriate algorithm is to scan the entire array from left to right, finding the node with the lowest frequency. In the case of a tie, the node to the left always wins (since, because we always add new nodes to the right end of the array, nodes to the left must be "older" than nodes to the right).
So, it would appear that this algorithm is less efficient than the linked list algorithm. Is this really the case? Let's investigate...
First, note that the number of forest operations is entirely predetermined by the Huffman algorithm: there will be NUM_CHARS adds (for the characters), followed by NUM_CHARS-1 iterations of two dels followed by one add. As the last operation, there's one final del. (We could even calculate the expected running time based on the number of elements in the forest during each operation, but that level of detail is not necessary here.) What is key to note is that we do exactly the same number of adds and dels, independent of n. Counting these operations together, for the linked list we have n O(O(n) + O(1)) operations, while for the array representation we have n O(O(n) + O(n)) operations. In the end, the constant out front may be larger for the array operations, but the order is the same- and because array traversals are typically very efficient, and we don't need to do a malloc or free for every operation (malloc and free are not cheap compared to an array traversal), if implemented correctly the array approach will be as fast as the linked list approach. (For systems with older and slower mallocs, it will probably beat the pants off the linked list implementation.)
This refinement affects both add and del. Add is simplified- we just keep track of how many adds have already taken place, and put each added element in the next position. We know that this position will be empty and to the right of all filled positions. Therefore, add becomes an O(1) operation. Del must be modified so that it knows enough to skip over empty spaces (which would be indicated by some sentinel value, such as a NULL pointer). Therefore, del, remains an O(n) operation- but now n is the number of adds made the forest, instead of the number of elements in the forest- in theory, this could get a lot worse. In the Huffman algorithm, it gets somewhat worse (because n keeps increasing out to (2 * NUM_CHARS) - 1, instead of decreasing), but it's still bounded and the total may be less than for the first array approach (and there are ways to speed it up as well- but read the next two sections before thinking about them).
However, relaxing this restriction leads to another problem- without a rigorous and deterministic definition for what node is selected at each step, it becomes impossible to specify in an abstract sense how the Huffman tree is built, since the codes depend on the order of node selection. Instead of distributing an abstract algorithm (as in the assignment) and telling everyone that so long as they use the given algorithm, their huff and puff programs will be compatible with everyone else's, you'd either have to accept the possibility of incompatible huff and puffs, or specify the algorithm for the forest representation in detail.
It's just not worth the trouble. Read the next section for more reasons why it's not worth paying this price for optimality here.
If you analyze where the time is really spent in huff and puff, you'll find that the I/O completely dominates the efficiency, for all but very small files (which a truly clever huff or puff would ignore anyway, since compressing anything less than about 4K in length is pointless on our systems, which apparently has a block size of 4K).
Therefore, the chief concern should generally be how easy and clear a forest representation is to code, debug, and extend in the future, rather than how efficient it is. (Of course, if circumstances change, and you need to squeeze every drop of performance out of the system, then your priorities may be different.) Since we already have on hand a bunch of debugged linked list routines that we're familiar with (thanks to assignment 6), it is easy and natural to use them. However, if you were ship-wrecked on an desert island without them, you might find it easier to toss together an array implementation.
As you were writing your code, you probably made liberal use of dynamically allocated memory. In most cases, however, it was probably true that you didn't need to use dynamic allocation at all, or could have gotten away with a great deal less of it.
Consider the allocation of the list nodes used to build the forest- you know exactly how many of these that you'll need (at most NUM_CHARS at any given instant) and so a fixed number of these could be preallocated all at once (i.e. in an array), and then you can carve off each as you need another).
The number of tree cells that you need to build the Huffman tree is also bounded- it is exactly (2 * NUM_CHARS) - 1 (because a complete tree with NUM_CHARS leaves has exactly NUM_CHARS - 1 internal nodes).
Similarly, the amount of space needed to store the character codes is bounded- without thinking about it too much, in the worst case the longest code will be NUM_CHARS-1 bits in length, requiring NUM_CHARS chars to store as a NUL-terminated character string. Therefore, if we allocate NUM_CHARS*NUM_CHARS chars to store the codes, we know we'll have enough. With some more thought, however, we can find smaller bounds. (Unfortunately, I am incapable of such deep thought at the moment.) What this boils down to is that if we don't want to use dynamic allocation, we don't have to. Since use and misuse of dynamic allocation is error-prone, whether or not a particular algorithm requires dynamic allocation can be interesting. In this case, it's easier to use it, but there might be cases where it's not.
One solution is to read and write larger objects (for example, longs). This reduces the actual number of I/O operations, but doesn't reduce the overhead of deciding whether or not it's time to write. How can we accomplish this?
If we really want things to scream, we can reduce all the reads and writes due to bit operations to a single read or write.
For reading, we figure out how long the file is (via the stat system call, or a combination of fseek and ftell, from the stdio library), malloc a character array large enough to hold the entire file, and then slurp in the whole file with a single fread. Now instead of indexing within a byte, we also need to index within chars, but this is easy enough: to get at the ith bit in array, use something like:
array [i / BITS_PER_BYTE] & (1 << (i % BITS_PER_BYTE));
(Note that we can't use stat to tell us the length of a pipe or network connection, so if we're not sure that we'll be reading from an actual disk file, this approach will not work.)
Writing can be done in a similar manner. We know how many bits we need to write, thanks to properties of the Huffman code. The number of bits is the sum of the products of the frequency of each character and the code length for that character. If we divide by the number of bits in a byte (rounding up so that any stragglers won't get lost), we'll have the total number of bytes that we'll write. Note that this is a property of the Huffman coding algorithm, and is not generally true for arbitrary algorithms.
We can remove the bit_file_t structure and replace it with two new structures- the bit_file_r_t structure and the bit_file_w_t structure. Similarly, we'll split bit_open and bit_close into two functions each- one version of bit_open will be called if you want to read, and it will return a bit_file_r_t, etc. Finally, bit_flush and bit_write will be modified to take as their argument a bit_file_w_t, and bit_read will take a bit_file_r_t. With this done, the compiler will, as it checks for the proper types being sent to each function, check that we're not passing in the wrong sort of bit file. This can be defeated by an improper cast- but then again, so can just about anything.
I don't generally advocate this sort of technique. It's nice to make the bit I/O operations look somewhat like the stdio I/O operations, and the stdio I/O operations look fine already. Proliferation of new types can get to be a nuisance. Also, this approach makes it quite hairy to add the capability to support bit streams that are available for both reading and writing- the stdio model is better for this. However, it's worth thinking about things like this; maybe this sort of technique will prove useful at some point.
Dan Ellard - (ellard@das.harvard.edu)