CS50 - Section 13

Section Information


Huff and Puff

I'm going to assume that you all know all about Huff and Puff by now- we've certainly talked about it at great length, and there's a solution set online that you can look at to resolve any lingering questions. Instead of talking about how you did this assignment, I'm going to focus on how you might have done this assignment, if instead of the wealth of suggestions and broad hints we gave (in lecture, the assignment, section, reviews, etc), we'd given you a different (but equally valid) set of broad hints.

The first thing that we must come to grips with is that there are other approaches to this problem. To some people, huff and puff seemed so hard that finding one solution was hard enough, so it might be annoying to hear that there are many other possible approaches. This is the case, however, and exploring some of these will, I hope, stir up some interesting thoughts.


The Forest

Instead of tearing the whole program apart and starting over, let's start by considering how one of the modules could be rewritten. The module that we'll consider first is the body of code that deals with the forest.

One flaw that many people made in their design was to not provide a high enough level of abstraction (or functional decomposition) for their forest data structure. (In fact, some people even merged their forest and tree data structures together, which defeats this almost entirely.) If you design your program around a forest that is implemented by a linked list, it's very easy for the assumption that your forest is a linked list to start creeping into the rest of your code. For the sake of this assignment, this is not a big problem; it's OK to make assumptions based on a data structure if you know that the data structure is both sufficient to the purpose and will not change over the life of your program- which was certainly the case for most of you.

Our first step, therefore, will be to banish these assumptions. We'll define a functional interface to for the forest routines, and replace all references to the forest with these functions.

Here is a brief description of the functions. (Note that the type specifications are deliberately vague.)

forest_t forest_init (parameters);

Do whatever work is necessary to initialize the forest. This function's parameters are not fully specified here; the parameters that are necessary may vary for each particular forest representation.

Note that forest_init does not fill in the initial values of the forest (i.e. the NUM_CHARS initial nodes).

bool forest_huff_add (forest, node);

Add the given node to the forest. Obeys the rules given by the Huffman algorithm (which is why "huff" appears in the name).

node forest_huff_del (forest);

Returns the next node that should be removed from the forest, according to the Huffman algorithm for selecting the next node to remove to combine with another node.

void forest_delete (forest)

Destroy the given forest, free'ing all the dynamically allocated memory (if any). This function isn't actually used by huff and puff, but it might be a good idea...

Assuming that this interface is sufficient, we can now code the rest of huff and puff without giving another thought to how these functions are actually implemented (although eventually we will care, because we'll need to implement them...).

A Linked-List Forest

This was the representation everyone used (thanks to a number of hints). There was a mix of singly and doubly linked lists; either is OK for the sort of thing that we need to do.

An Array Forest

Linked lists can be painful to debug- pointers hither and thither, dynamically allocated memory, and so forth. Arrays have a certain appeal.

Can we use arrays to represent the forest? The answer is yes, as long as we can implement the interface functions:

forest_t forest_init (int size);

If our forest is an array, then we need to specify how large it is (so that this function can go off and allocate an array of that length).

Luckily, we can specify the size ahead of time, since we know that for huff and puff the forest will contain at most NUM_CHARS elements. If we create a forest large enough to hold that many elements, we'll always have enough (and actually always have more than enough, except for the brief instant when we use them all).

bool forest_huff_add (forest, node);

Since we know that the array will always have a free spot whenever we want to add anything, this would (at first glance) appear to simply be a matter of finding the first empty spot and depositing the new node at that position. Fortunately, life is more interesting- this simple approach will make it very hard to implement forest_huff_del in such a way that it obeys the rules of selecting the next element to pull out of the forest- we won't have information about the order that the nodes were stored in.

Here's a way to fix this problem- to add a new node:

  1. Slide each element in the array as far "down" (to the "left") as it will go, thereby filling in any holes in the array and creating a new hole at the end.

  2. Place the new element after the last occupied position in the array. (Note- there always will be such an position, if we've sized the array properly.)

node forest_huff_del (forest);

Restating the rules for selecting the node to remove: the next node that should be removed from the forest is the node with the lowest frequency. If there are two nodes with the same frequency, the one that has been in the forest the longest is removed.

Unfortunately, the array is not sorted, and so the position of the appropriate node is unknown- it might even be the last one (the most recently added node). Therefore, the appropriate algorithm is to scan the entire array from left to right, finding the node with the lowest frequency. In the case of a tie, the node to the left always wins (since, because we always add new nodes to the right end of the array, nodes to the left must be "older" than nodes to the right).

void forest_delete (forest)

All we need to do to destroy the forest is free the array.

This seems painful- to pull an element out of the forest requires O(n) steps (scanning the entire array), as does adding an element. How does this compare with the linked list approach? In the linked list approach, pulling an element out is O(1)- we know it's always the first one, so we just grab it. Adding an element, however, is O(n), since we may need to scan the entire list in order to find the appropriate position- and this is the worst case; we usually expect to only need to scan part of the list.

So, it would appear that this algorithm is less efficient than the linked list algorithm. Is this really the case? Let's investigate...

First, note that the number of forest operations is entirely predetermined by the Huffman algorithm: there will be NUM_CHARS adds (for the characters), followed by NUM_CHARS-1 iterations of two dels followed by one add. As the last operation, there's one final del. (We could even calculate the expected running time based on the number of elements in the forest during each operation, but that level of detail is not necessary here.) What is key to note is that we do exactly the same number of adds and dels, independent of n. Counting these operations together, for the linked list we have n O(O(n) + O(1)) operations, while for the array representation we have n O(O(n) + O(n)) operations. In the end, the constant out front may be larger for the array operations, but the order is the same- and because array traversals are typically very efficient, and we don't need to do a malloc or free for every operation (malloc and free are not cheap compared to an array traversal), if implemented correctly the array approach will be as fast as the linked list approach. (For systems with older and slower mallocs, it will probably beat the pants off the linked list implementation.)

Another Array Forest

We could use a different array approach to speed things up even further, at the cost of additional memory. In order to implement this approach, we need to know ahead of time how many adds we'll do total (not the number of elements that might be in the forest simultaneously, but the total number of adds that will be done during the life of the forest). Once again, we happen to know this information- there will be exactly (2 * NUM_CHARS) - 1 adds, so we can use this approach.

This refinement affects both add and del. Add is simplified- we just keep track of how many adds have already taken place, and put each added element in the next position. We know that this position will be empty and to the right of all filled positions. Therefore, add becomes an O(1) operation. Del must be modified so that it knows enough to skip over empty spaces (which would be indicated by some sentinel value, such as a NULL pointer). Therefore, del, remains an O(n) operation- but now n is the number of adds made the forest, instead of the number of elements in the forest- in theory, this could get a lot worse. In the Huffman algorithm, it gets somewhat worse (because n keeps increasing out to (2 * NUM_CHARS) - 1, instead of decreasing), but it's still bounded and the total may be less than for the first array approach (and there are ways to speed it up as well- but read the next two sections before thinking about them).

Other Ideas

Some of the possible approaches are a little clumsy because of the rules that determines what element do delete at each step. If we relax the restriction that the oldest element that has the smallest frequency is selected (and instead just say that any element with the smallest frequency will do), then many more optimizations are possible- for example, we could use a heap to implement the forest, in which case both add and del become O(log2 n) operations.

However, relaxing this restriction leads to another problem- without a rigorous and deterministic definition for what node is selected at each step, it becomes impossible to specify in an abstract sense how the Huffman tree is built, since the codes depend on the order of node selection. Instead of distributing an abstract algorithm (as in the assignment) and telling everyone that so long as they use the given algorithm, their huff and puff programs will be compatible with everyone else's, you'd either have to accept the possibility of incompatible huff and puffs, or specify the algorithm for the forest representation in detail.

It's just not worth the trouble. Read the next section for more reasons why it's not worth paying this price for optimality here.

However...

We're worrying too much about optimizing the forest structure!

If you analyze where the time is really spent in huff and puff, you'll find that the I/O completely dominates the efficiency, for all but very small files (which a truly clever huff or puff would ignore anyway, since compressing anything less than about 4K in length is pointless on our systems, which apparently has a block size of 4K).

Therefore, the chief concern should generally be how easy and clear a forest representation is to code, debug, and extend in the future, rather than how efficient it is. (Of course, if circumstances change, and you need to squeeze every drop of performance out of the system, then your priorities may be different.) Since we already have on hand a bunch of debugged linked list routines that we're familiar with (thanks to assignment 6), it is easy and natural to use them. However, if you were ship-wrecked on an desert island without them, you might find it easier to toss together an array implementation.


Malloc and Free

A small digression...

As you were writing your code, you probably made liberal use of dynamically allocated memory. In most cases, however, it was probably true that you didn't need to use dynamic allocation at all, or could have gotten away with a great deal less of it.

Consider the allocation of the list nodes used to build the forest- you know exactly how many of these that you'll need (at most NUM_CHARS at any given instant) and so a fixed number of these could be preallocated all at once (i.e. in an array), and then you can carve off each as you need another).

The number of tree cells that you need to build the Huffman tree is also bounded- it is exactly (2 * NUM_CHARS) - 1 (because a complete tree with NUM_CHARS leaves has exactly NUM_CHARS - 1 internal nodes).

Similarly, the amount of space needed to store the character codes is bounded- without thinking about it too much, in the worst case the longest code will be NUM_CHARS-1 bits in length, requiring NUM_CHARS chars to store as a NUL-terminated character string. Therefore, if we allocate NUM_CHARS*NUM_CHARS chars to store the codes, we know we'll have enough. With some more thought, however, we can find smaller bounds. (Unfortunately, I am incapable of such deep thought at the moment.) What this boils down to is that if we don't want to use dynamic allocation, we don't have to. Since use and misuse of dynamic allocation is error-prone, whether or not a particular algorithm requires dynamic allocation can be interesting. In this case, it's easier to use it, but there might be cases where it's not.


A Bit About bit.c

Reading and Writing

The big time-sink in the bit operations is figuring out whether it's time to read or write a byte, and then doing the actual read and write. How can we improve upon this?

One solution is to read and write larger objects (for example, longs). This reduces the actual number of I/O operations, but doesn't reduce the overhead of deciding whether or not it's time to write. How can we accomplish this?

If we really want things to scream, we can reduce all the reads and writes due to bit operations to a single read or write.

For reading, we figure out how long the file is (via the stat system call, or a combination of fseek and ftell, from the stdio library), malloc a character array large enough to hold the entire file, and then slurp in the whole file with a single fread. Now instead of indexing within a byte, we also need to index within chars, but this is easy enough: to get at the ith bit in array, use something like:

	array [i / BITS_PER_BYTE] & (1 << (i % BITS_PER_BYTE));

(Note that we can't use stat to tell us the length of a pipe or network connection, so if we're not sure that we'll be reading from an actual disk file, this approach will not work.)

Writing can be done in a similar manner. We know how many bits we need to write, thanks to properties of the Huffman code. The number of bits is the sum of the products of the frequency of each character and the code length for that character. If we divide by the number of bits in a byte (rounding up so that any stragglers won't get lost), we'll have the total number of bytes that we'll write. Note that this is a property of the Huffman coding algorithm, and is not generally true for arbitrary algorithms.

Error Checking

Another place where time is spent on each call to bit_read and bit_write is in error checking- making sure that bit_read isn't called with a bit stream that was opening for writing, or vice-versa. There's no perfect way to get around this, but there is a way to do most of the error checking up front, during compilation, instead of at run time.

We can remove the bit_file_t structure and replace it with two new structures- the bit_file_r_t structure and the bit_file_w_t structure. Similarly, we'll split bit_open and bit_close into two functions each- one version of bit_open will be called if you want to read, and it will return a bit_file_r_t, etc. Finally, bit_flush and bit_write will be modified to take as their argument a bit_file_w_t, and bit_read will take a bit_file_r_t. With this done, the compiler will, as it checks for the proper types being sent to each function, check that we're not passing in the wrong sort of bit file. This can be defeated by an improper cast- but then again, so can just about anything.

I don't generally advocate this sort of technique. It's nice to make the bit I/O operations look somewhat like the stdio I/O operations, and the stdio I/O operations look fine already. Proliferation of new types can get to be a nuisance. Also, this approach makes it quite hairy to add the capability to support bit streams that are available for both reading and writing- the stdio model is better for this. However, it's worth thinking about things like this; maybe this sort of technique will prove useful at some point.


Please bring any errors or inconsistencies in this document to the attention of the author.

Dan Ellard - (ellard@das.harvard.edu)