CS 161: Operating Systems

Tuesday/Thursday 1:00-2:30

Pierce 301

What GDB is and why you need to use it

GDB is a debugger. The fundamental point of a debugger is to stop and inspect the state of a running program. This helps you analyze the behavior of your program, and thus helps diagnose bugs or mistakes. With GDB you can (in general) do the following things:

Control aspects of the environment that your program will run in.
Start your program or connect to an already-running program.
Make your program stop for inspection.
Step through your program one line at a time or one machine instruction at a time.
Inspect the state of your program once it has stopped.
Change the state of your program and then allow it to resume execution.

In your previous programming experience, you may have managed without using a debugger. You might have been able to find the mistakes in your programs by printing things on the screen or simply reading through your code. Beware, however, that OS/161 is a large and complex body of code, much more so than you may have worked on in the past. To make matters worse, you didn't write most of it. A debugger is an essential tool in this environment. We would not lie if we said that there has never been a student in CS161 who has survived this class without using GDB. You should, therefore, take the time to learn GDB and make it your best friend. (Or rather, your second best friend; your best friend should be your partner.) This guide will explain to you how to get started debugging OS/161, describe the most common GDB commands, and suggest some helpful debugging techniques.

How to start debugging

Because you're trying to debug the OS/161 kernel, which is running inside System/161 (and not System/161 itself) you can't just do

   gdb sys161

as that runs the debugger on System/161. Nor can you do

   gdb kernel
   (gdb) run

as that tries to run the kernel outside of System/161, which isn't going to work.

Instead, you need to connect the debugger to the kernel via System/161's debugger interface. Also, you can't use the regular system copy of GDB, as the OS/161 kernel is MIPS and you need a version of GDB configured to debug MIPS. Among the tools for working with OS/161 is a copy of GDB so configured; you can run it as os161-gdb.

The complete procedure is as follows.

Open two shell windows. These must be logged in on the same machine -- in our case both in the CS50 Appliance, but if you're working in some other environment you might need to take special steps to arrange this.
In both windows, change to the OS/161 root directory.
```
   % cd ~/cs161/root
```
In one window, boot OS/161 in System/161, and use the -w option to tell System/161 to wait for a debugger connection:
```
   % sys161 -w kernel
```
In the other window, start os161-gdb on the same kernel image, and tell gdb to connect to System/161:
```
   % os161-gdb kernel
   (gdb) target remote unix:.sockets/gdb
```

At this point, if everything worked properly, GDB will connect to your kernel and tell you that the target program is stopped in start.S. It is, in fact, waiting at the very first instruction of the kernel, as if you'd just started the kernel in GDB. It also normally can't find the source listing for start.S; there's a way to fix that (see below) but it doesn't matter just yet.

Now you can use GDB to debug almost exactly as you would debug an ordinary program. Generally you aren't interested in debugging start.S, so the first thing you typically want to do is set one or more breakpoints and then continue the kernel until it reaches one.

A first GDB run

This first time, you aren't hunting a specific bug, you're just trying things out; so a good place to put a breakpoint is kmain. This is the kernel's equivalent of an ordinary C program's main: the beginning of the program, where execution moves out of the preliminary assembly-language startup code into C code. Start up the kernel in GDB as above, and then put a breakpoint on kmain as follows:

   (gdb) break kmain

Now continue:

   (gdb) c

and GDB will stop again in main.c. You'll be stopped on a line that calls boot(), the function that sequences the kernel's initial bootup actions. Single-step into this:

   (gdb) s

This will take you to the first line of boot, which is a call to kprintf, the OS/161 function for printing miscellaneous messages to the system console. Single-step over this call:

   (gdb) n

to execute the kprintf call and stop on the next line. You can get the listing of the area around where you are by typing

   (gdb) list

You'll notice that even though there are several calls to kprintf here, nothing actually comes out on the console. Single-step with n over the next few lines (first some kprintf calls, then calls for bootstrapping various kernel subsystems) until you get to mainbus_bootstrap.

When you step over mainbus_bootstrap, suddenly all the console messages print out. This is because nothing can actually come out of the console until the kernel searches for and finds the console hardware; this happens inside mainbus_bootstrap. This is important to remember: if the kernel hangs or dies without printing anything, it does not mean that something inexplicably horrible happened before that first kprintf, it just means that something happened before the call to mainbus_bootstrap. You will be writing code that happens before there, so chances are this will happen to you at least once this semester. One of the reasons GDB is important: tracking such problems down with GDB is pretty easy. Tracking them down without GDB, when you can't print anything out, is ... interesting.

Now tell GDB to execute through to the end of boot:

   (gdb) finish

It will come to a stop again on the second line of kmain, which calls the kernel menu code. You can step into this to see how the menu works, but it's actually not all that interesting. So instead, put a breakpoint on the reboot system call handler and continue:

   (gdb) break sys_reboot
   (gdb) c

Now the kernel will print the menu prompt waiting for input. Run the poweroff utility, by typing (in the other window):

   OS/161 kernel [? for menu]: p /sbin/poweroff

If you step through this you'll see that after attending to various shutdown tasks, it calls mainbus_poweroff to shut off the electricity on the system board. At that point System/161 will exit, and GDB will print Remote connection closed and you're done.

Connecting and disconnecting GDB

While you often know that you want to run in the debugger when starting the kernel, in which case you use the method above, this isn't always the case. You can attach GDB to System/161 at any time by simply running it and entering the target remote line. This connects to System/161's debugger port, and that causes System/161 to stop.

System/161 will itself also sometimes stop and wait for a debugger connection. This will happen on various hardware-level conditions that are not normally supposed to happen; it will also happen if the kernel executes a breakpoint instruction or requests the debugger explicitly via the ltrace device. And it will happen if you turn System/161's progress monitoring on (see the System/161 manual for an explanation) and the kernel fails to make progress in the specified amount of time. In these cases you attach the debugger in the same way: by running it and entering the target remote line.

Importantly, you can also make System/161 stop and wait for the debugger by pressing ^G. This can be useful if your kernel is looping or deadlocked.

Because typing that target remote command over and over is a major nuisance, we suggest setting up a macro for it. Edit the file ~/cs161/root/.gdbinit and put the following in it:

define db
target remote unix:.sockets/gdb
end

On startup, GDB reads the file .gdbinit in the current directory and executes the GDB commands it finds in it; so now you can connect to System/161 by just typing

   (gdb) db

It is also often useful to put common breakpoint settings in .gdbinit, such as on panic or badassert. (Note that as of OS/161 2.0, panic will itself automatically request a debugger connection. But you may find that putting an explicit breakpoint on it is still helpful.)

Note that for reasons we don't yet understand (if we understood them, we have have fixed this), sometimes connecting the debugger while System/161 is running (rather than waiting for a debugger connection) doesn't work. Also sometimes, but not usually the same times, pressing ^G doesn't seem to work. (If you manage to get a repeatable test case for either of these conditions, contact dholland@eecs.harvard.edu.)

Also note that if GDB is already connected, and you tell it to continue and the kernel doesn't stop again when you expected, getting it to stop so you can do more with GDB can be annoyingly difficult. Theoretically at this point pressing ^G in the System/161 window will cause it to stop executing and return control to GDB, but sometimes this doesn't work either. Hitting ^C in GDB will get you a GDB prompt again, but one where you can't do much of anything because the kernel is still running; GDB suggests you type interrupt, but this doesn't work. If all else fails, you can kill GDB, start another copy, and reconnect. Don't hit ^C in System/161 as that will kill it and lose your state.

When you are done debugging, you can disconnect the debugger from System/161 (and thus the running kernel) in two ways:

   (gdb) detach

will unhook the debugger and leave the kernel still running, whereas

   (gdb) kill

will unceremoniously nuke System/161, much as if you'd gone to its window and typed ^C. Quitting out of gdb while connected is the same as using kill.

Caution: Be sure that the kernel image that you run in System/161 and the one you run os161-gdb on are the same. If they aren't, it's like using a different edition of a textbook with different page numbers and not realizing it: things can get very bizarre and very confusing.

Executing under GDB control

Here are the most common and often-used commands for controlling the execution of the kernel in GDB:

s, step - Step through the program. If you want to go through your program step by step after it has hit a breakpoint, use the "step" command. Typing

(gdb) s

will execute the next line of code. If the next line is a function call, the debugger will step into this function.

n, next - Execute the next line. This command is similar to the "step" command, except that it does not step into a function, but executes the function, as if it were a simple statement. This is extraordinarily useful for stepping over functions such as kprintf or other functions that you know/believe to be correct.

adv, advance - Advance to a specified line This is like "next" but continues to a specified line number. It is like setting a breakpoint on that line and using "continue". If you accidentally pick a line that you won't get to, it won't stop until the next breakpoint.

finish - Continue until function return. This command advances until the end of the function.

c, continue - continue execution. When the kernel is stopped (at a breakpoint, or after connecting, or after executing manually for a while), type

(gdb) c

to just continue executing. Execution continues until something happens that stops it, such as hitting a breakpoint, or if you type ^G into System/161. (See above for some notes on ^G.)

b, break - set a breakpoint. Use this command to specify that your program should stop execution at a certain line or function. Typing

(gdb) b 18

means that your program will stop every time it executes a statement on line 18. As with the "list" command, you can specify break on a function, e.g.:

(gdb) b main

will stop when execution encounters the main function.

info breakpoints - List all breakpoints.

d, delete - Delete a breakpoint by number. Use this command to delete a breakpoint. By typing

(gdb) d 1

you will delete the breakpoint numbered 1. GDB displays the number of a breakpoint when you set that breakpoint. Typing "d" without arguments deletes all breakpoints.

clear - Clear a breakpoint by name/address. If you don't remember the number of the breakpoint you want to delete, use the "clear" command. Just like the "breakpoint" command, it takes a line number or a function name as an argument.

disable, enable - Disable/enable a breakpoint. Use these commands with a breakpoint number as an argument to disable or enable a breakpoint. When you disable a breakpoint, the breakpoint is still there, but execution will not stop at it until you enable the breakpoint.

cond - Make a breakpoint conditional. This can be helpful if you need to set a breakpoint on a very common code path. Usage is like this:

   (gdb) cond 3 count > 1000

which makes breakpoint 3 only happen when the expression count > 1000 is true. CAUTION: in past years, we have found that use of conditional breakpoints seems to corrupt GDB's internal state such that it starts lying to you. This year (2016) we have a substantially newer GDB version that we hope no longer has this bug. Until we have further data, use conditional breakpoints cautiously, and let the course staff know what your experiences are.

command - Execute a command on a breakpoint. You can specify that a certain command or a set of commands be executed at a breakpoint. For example, to specify that a certain string and a certain value are printed every time you stop at breakpoint 2, you could type:

(gdb) command 2
 > printf "theString = %s\n", theString
 > print /x x
 > end

Inspecting stuff with GDB

The chief purpose of a debugger is to inspect the target program. Here are the most common/useful commands for doing that.

l, list - List lines from source files. Use this command to display parts of your source file. For example, typing

(gdb) l file.c:101

will display line 101 in file.c. If you leave off the file name, GDB will use what it thinks is the current source file; this is usually the most recently referenced source file but occasionally isn't what one expects. You can also give a function name to list starting at the top of that function. And if you give no arguments GDB will list the next chunk of the file you last listed or if you just stopped execution, the point at which you stopped.

Note: If you need to debug in start.S (or any other assembler file) GDB can't find the sources by default. This is because of a glitch in the toolchain: the paths to the assembler source files are relative, not absolute, so they only work by default in the kernel build directory. To work around it, you can tell GDB to start looking for the sources in the kernel build directory:

   (gdb) directory /home/jharvard/cs161/os161/kern/compile/ASSTN

for whatever assignment N you're currently working on. Tip: you can put this in .gdbinit.

disassemble - Disassemble some code. Use this command to print disassembly of the machine code for a function. Add /m to mix in the source listing as well. Hopefully you won't need to use this; but it can occasionally be useful.

x - Examine memory. This is a general-purpose function for dumping memory; it has a variety of options and forms, all of which take a memory address (which can be the name of a variable or function) as an argument. The most useful are:

x/x - dump memory as hex
x/4x - dump memory as hex four at a time
x/64x - dump memory as hex 64 at a time, which fits 256 bytes nicely on a normal-size screen
x/s - dump memory as null-terminated strings
x/c - dump memory as single characters
x/i - dump memory as disassembly

but there are many more variations; see the GDB help on the command for the full details.

bt, where - Display backtrace. To find out where you are in the execution of your program and how you got there, use one of these commands. This will show the backtrace of the execution, including the function names and arguments.

up, down - Navigate the backtrace. You can move up the stack to parent stack frames to inspect things there. With both the "up" and "down" commands you can give an optional numeric argument to move up or down that many frames instead of just one.

frame - Select a stack frame in the backtrace. With a numeric argument, this moves up (or down, as needed) the stack trace to the frame with the specified number. (Frame 0 is the innermost, where execution is actually occurring.) Without an argument it reselects the last frame you navigated to, which both prints it out and has the often-useful side effect of resetting the current listing position to there.

p, print - Print an expression. Use for example

   (gdb) p name

to print the value of name. You can use arbitrarily complex C expressions; it is often useful for example to include typecasts. Normally the value will be printed according to its type; you can insert insert /x, /s, /d, /u to force printing as hex, as a string, or as a signed or unsigned integer respectively.

When you print something, you'll get a result line like this:

   $1 = 0x80063d80 ""

The $1 is a convenience variable, a scratch variable created and remembered by GDB for your subsequent use. You can use $1 as a value in subsequent expressions and it will retrieve the value that was printed; this saves having to retype it or cut and paste it. (Note though that it saves the value, not the expression; if you execute for a while and then use $1 again it will have the same value regardless of what's happened in the meantime to the values in the expression that generated it.

printf - Formatted print. The "printf" command allows you to specify the formatting of the output, just like you do with a C library printf() function. For example, you can type:

(gdb) printf "X = %d, Y = %d\n",X,Y

info registers - Show the processor registers. This prints all the processor registers. You can also use values from registers in "print" expressions: a $ followed by the symbolic name of the register fetches that register's value; e.g. the return value of a function is normally left in $v0.

display - Display an expression. Compute and display the value of an expression every time the program stops. For example:

   (gdb) display abc

If there is no abc in scope when the program stops, the display won't appear. This command is otherwise the same as "print", including the support for type modifiers.

undisp - Remove a "display". The argument is the number of the "display" expression to remove. With no argument, all "display" expressions are removed. Also, delete display N is the same as undisp N.

set variable - Assign a value to a variable. Sometimes it is useful to change the value of a variable while the program is running. For example if you have a variable "x", typing

   (gdb) set variable x = 15

will change the value of "x" to 15.

Other GDB commands to note

define - Define a new command. Use this if you want to define a new command. This is similar to assigning a macro and can save you typing. Just as with a macro, you can put together several commands. An example of defining a short form of the target remote command for attaching to System/161 was provided above. Some other more elaborate examples can be found in the gdbscripts directory in the OS/161 source tree.

help - Get help (on gdb, not your CS161 assignments!) To find more about a particular command just type:

(gdb) help <command name>

Note that several of the commands listed above (e.g. "info breakpoints") are specific instances of more general GDB commands, like "info" and "set".

GDB threads support

Nowadays GDB understands the concept of multiple threads; you can switch from one thread to another while debugging in order to inspect all the available execution contexts. The good news is: this now works with System/161. The not so good news is: GDB threads map to CPUs, not to OS/161 kernel threads. This means that while you can inspect all the CPUs that are executing, you still can't easily inspect sleeping threads. Unfortunately, GDB's thread model combined with the fact that the debugger talks transparently to System/161 makes this mapping more or less the only choice.

When stopped in GDB, list the threads like this:

   (gdb) info threads

This will give a display something like this:

     Id   Target Id         Frame 
     4    Thread 13 (CPU 3) wait () at ../../arch/mips/thread/cpu.c:279
     3    Thread 12 (CPU 2) wait () at ../../arch/mips/thread/cpu.c:279
     2    Thread 11 (CPU 1) wait () at ../../arch/mips/thread/cpu.c:279
   * 1    Thread 10 (CPU 0) runprogram (progname=0x800f3f40 "/bin/sh")
       at ../../syscall/runprogram.c:492

The leftmost column is the thread number that you need to use while talking to GDB, and the CPU number listed is the System/161 CPU number. Unfortunately the GDB thread numbers are offset by 1 at the start and may diverge further over time at GDB's whim. (The "Thread 13" number is the number used in the communications channel between GDB and System/161; you don't need to care about it. It's offset from the CPU number by 10 because if you use 0 GDB dumps core.)

To switch to another CPU, use the "thread" command with the number from the leftmost "Id" column:

   (gdb) thread 1

The GDB "text user interface"

GDB has a curses-based "text user interface" (TUI) that gives you a slightly less 1980 experience when debugging. This gives you a split-screen window with the current source file in the top part and the GDB prompt in the bottom part. Type ^X a (that's control-X followed by the letter a) after starting GDB to enable it, and type that again to make it go away.

In TUI mode most GDB commands work as before; there are just a couple of keystrokes you need to know about:

^L - redraw screen (like in many programs)
^X o - switch to other panel in the GDB window (like in Emacs)

By default the current panel is the source listing, but typing goes into the GDB prompt panel anyway; however, to use the arrow keys for input editing you need to switch to the GDB prompt panel.

There is also a graphical front end to GDB called DDD; it should be possible to use this with the OS/161 version of GDB. No current information on this is available right now, unfortunately.

Debugging tips

Tip #1: Check your beliefs about the program

So how do you actually approach debugging? When you have a bug in a program, it means that you have a particular belief about how your program should behave, and somewhere in the program this belief is violated. For example, you may believe that a certain variable should always be 0 when you start a "for" loop, or a particular pointer can never be NULL in a certain "if statement". To check such beliefs, set a breakpoint in the debugger at a line where you can check the validity of your belief. And when your program hits the breakpoint, ask the debugger to display the value of the variable in question.

Tip #2: Narrow down your search

If you have a situation where a variable does not have the value you expect, and you want to find a place where it is modified, instead of walking through the entire program line by line, you can check the value of the variable at several points in the program and narrow down the location of the misbehaving code.

Tip #3: Walk through your code

Steve Maguire (the author of Writing Solid Code) recommends using the debugger to step through every new line of code you write, at least once, in order to understand exactly what your code is doing. It helps you visually verify that your program is behaving more or less as intended. With judicious use, the step, next and finish commands can help you trace through complex code quickly and make it possible to examine key data structures as they are built.

Tip #4: Use good tools

Using GDB with a visual front-end can be very helpful. For example, using GDB inside the emacs editor puts you in a split-window mode, where in one of the windows you run your GDB session, and in the other window the GDB moves an arrow through the lines of your source file as they are executed. To use GDB through emacs do the following:

Start emacs.
Type the "meta" key followed by an "x".
At the prompt type "gdb". Emacs will display the message:
```
	Run gdb (like this): gdb
```
Delete the word "gdb", and type:
```
	os161-gdb kernel
```
So in the end you should have:
```
	Run gdb (like this): os161-gdb kernel
```
displayed in the control window.

At this point you can continue using GDB as explained in section 2.

Tip #5: Beware of printfs!

A lot of programmers like to find mistakes in their programs by inserting "printf" statements that display the values of the variables. If you decide to resort to this technique, you have to keep in mind two things: First, because adding printfs requires a recompile, printf debugging may take longer overall than using a debugger.

More subtly, if you are debugging a multi-threaded program, such as a kernel, the order in which the instructions are executed depends on how your threads are scheduled, and some bugs may or may not manifest themselves under a particular execution scenario. Because printf outputs to the console, and the console in System/161 is a serial device that isn't extraordinarily fast, an extra call to printf may alter the timing and scheduling considerably. This can make bugs hide or appear to come and go, which makes your debugging job much more difficult.

To help address this problem, System/161 provides a simple debug output facility as part of its trace control device. One of the trace control device's registers, when written to, prints a notice in the System/161 output including the value that was written. In OS/161, provided your System/161 has been configured to include the trace control device, you can access this feature by calling trace_debug(), which is defined in dev/lamebus/ltrace.h. While this is less disruptive than calling printf, it is still not instant and can still alter the timing of execution. By contrast, the System/161 debugger interface is completely invisible; as far as your kernel is concerned, time is stopped while you are working in the debugger.

Tip #6: Debugging deadlocks

A deadlock occurs when two or more threads are all waiting for the others to proceed. (We'll talk more about this in lecture.) In a simple deadlock there'll be one thread T1 that already holds a lock L1, and another thread T2 that already holds a lock L2. T1 is waiting to get L2, so it can't proceed until T2 runs, and T2 is waiting to get L1, so it can't proceed until T1 runs. The goal of debugging a deadlock is to find out the identities of T1 and T2, and of L1 and L2, and then figure out what combination of circumstances and code paths allowed them to get into this state.

To do this, you generally want to begin by finding one of the threads, since there are usually a lot more locks in the system than threads. Each thread that's blocked reports what it's blocked on in the field t_wchan_name. The basic procedure is to locate a thread (the first ones to start with are the ones currently on the CPUs; after that look in your process table that you wrote for assignment 2), check what it's waiting for, then find who's holding that. If it's waiting for a lock, hopefully the locks you wrote record who's holding them. That gives you another thread to examine; repeat until you find a loop. If it's waiting on a CV, it gets a bit more interesting because you need to figure out who was supposed to signal that CV and didn't... or if they did, why the waiting thread didn't receive the signal. For deadlocks involving spinlocks, remember that in OS/161 spinlocks are held by CPUs, not threads.

Once you have the participants identified, the stack traces from those threads will usually (but, alas, not always) give you enough information to work out what happened.

It is often helpful to set up a global table of all locks and/or CVs (OS/161 already comes with a global table of all wchans) to make this process easier. It is also possible, and potentially worthwhile, to hack more state information into the thread structure. One can even write an automatic deadlock detector, although this probably has a bad cost/benefit ratio and won't help you much when CVs are involved.

Tip #7: Debugging assembly

When GDB stops in an assembly source file (a .S file) various special considerations apply. GDB is meant to be a source-level debugger for high level languages and isn't very good at debugging assembly. So various tricks are required to get adequate results.

The OS/161 toolchain tells the assembler to emit line number information for assembly files, but as noted above (under "list") you need to tell GDB how to find them. Do that. Then you can at least see the file you're working on.

It is also sometimes helpful to disassemble the kernel; type

   % os161-objdump -d kernel | less

in another window and page or search through it as needed.

To single step through assembler, use the nexti and stepi commands, which are like next and step but move by one instruction at a time.

The command x /i (examine as instructions) is useful for disassembling regions from inside GDB.

Print register values with $ and the symbolic register names as described above ($v0, $a0, etc.) to see the values that are being handled.

Tip #8: trace161

The trace161 program is the same as sys161 but includes additional support for various kinds of tracing and debugging operations. You can have System/161 report disk I/O, interrupts, exceptions, or whatever. See the System/161 documentation for more information.

One of the perhaps more interesting trace options is to have System/161 report every machine instruction that is executed, either at user level, at kernel level, or both. Because this setting generates vast volumes of output, it's generally not a good idea to turn it on from the command line. (It is sometimes useful, however, in the early stages of debugging assignment 2 or 3, to log all user-mode instructions.) However, the trace options can be turned on and off under software control using the System/161 trace control device. It can be extremely useful to turn instruction logging on for short intervals in places you suspect something strange is happening. See dev/lamebus/ltrace.h for further information.

Tip #9: casting void pointers

If you have a void * in GDB and you know what type it actually is, you can cast it when printing, using the usual C expression syntax. This is very helpful when dealing with arrays, for example.

Tip #10: tracing through exceptions

When you get a stack backtrace and it reaches an exception frame, GDB can sometimes now trace through the exception frame, but it doesn't always work very well. Sometimes it only gets one function past the exception, and sometimes it skips one function. (This is a result of properties of the MIPS architecture and the way GDB is implemented and doesn't appear readily fixable.) Always check the tf_epc field of the trap frame to see exactly where the exception happened, and if in doubt, cross-check it against a disassembly or have GDB disassemble the address.

Where to go for help

For help on GDB commands, type "help" from inside GDB. You can find the GDB manual here and of course your friendly TFs are always there to help!