cyoung@research.bell-labs.com
Bell Laboratories
Researchers need tools to observe program behavior. Many hot research topics, including just-in-time compilation, run-time code generation, instruction set emulation and translation, and walk-time optimization require or benefit from run-time information about programs.
Halt, the Harvard Atom-Like Tool, allows SUIF users to instrument many aspects of program behavior. Users indicate interesting parts of the program by labeling them with SUIF annotations. Then Halt looks for these annotations, and inserts function calls to analysis routines that match the type of the annotation. Halt ships with a number of useful analysis routines; users may modify these or supply their own. Using different analysis routines, Halt provides a number of hardware simulators, performs branch stream analysis, and records statistics for profile-driven optimizations. Halt and its associated libraries have been used in projects on branch prediction, code layout, instruction scheduling, and register allocation.
Researchers have used a variety of techniques to observe and analyze the behavior of program runs. Historically, each technique required its own special-purpose analysis tool, but with the release of Atom by Digital Equipment Corporation [ See A. Srivastava and A. Eustace. "ATOM: A System for Building Customized Program Analysis Tools," Proc. ACM SIGPLAN `94 Conf. on Prog. Lang. Design and Impl. New York: ACM, June 1994. ], researchers gained access to an extremely flexible but still reasonably low-overhead software instrumentation tool. Unlike prior instrumentation systems, Atom allowed users to define their own analysis routines, linking them with the program to be analyzed. Atom's flexibility and low overhead also encouraged researchers to perform on-line analysis, where all processing was done during program run time, rather than recording and storing a trace of the program, then post-processing the trace off-line. Atom thus allowed its users to gather virtually any information they wanted, and left the choice of on-line or off-line analysis to them.
Many recent approaches to improving microprocessor performance have attacked or blurred the traditional lines between run-time (hardware) and compile-time (software) techniques. Digital's CPI [ See J.M. Anderson, L. Berc, J. Dean, S. Ghemawat, M. Henzinger, S.-T. A. Leung, D. Sites, M. Vandevoorde, C. Waldspurger, and W. E. Weihl. "Continuous profiling: Where have all the cycles gone?" Technical Note 1997-016. Digital Equipment Corporation Systems Research Center, Palo Alto, Calif., July 1997. ] and Harvard's Morph project [ See X. Zhang, Z. Wang, N. Gloy, J. B. Chen, and M. D. Smith. "System Support for Automatic Profiling and Optimization," Proc. 16th ACM Symposium on Operating Systems Principles. New York: ACM, Oct. 1997. ] collect profile information using statistical sampling techniques at very low overhead; these profiles can then be used later to optimize the program. Machine emulators such as Digital's FX!32 [REF] initially interpret all of a program's code, but then translate the most frequenlty executed parts of a program into native, optimized machine code. The general philosophy of "walk-time optimization" [ See J.A. Fisher. "Walk-Time Techniques: Catalyst for Architectural Change." IEEE Computer 30.9 (September 1997): 40-42. ] seeks to support much more variety in hardware implementation than was previously thought possible, while still providing high performance for a single distributed software image.
Admittedly, Halt is much more of a traditional profiler or simulation support package than a system that blurs the boundary between hardware and software. However, using the dynamic information collected by Halt, researchers can explore these kinds of techniques in the SUIF compiler system.
The SUIF compiler [ See R. Wilson, R. French, C. Wilson, S. Amarasinghe, J. Anderson, S. Tjiang, S. Liao, C. Tseng, M. Hall, M. Lam, and J. Hennessy. "SUIF: An Infrastructure for Research on Parallelizing and Optimizing Compilers," ACM SIGPLAN Notices, 29.12 (Dec. 1994): 31-37. ] is a research compiler that has been used to investigate automatic parallelization, interprocedural analysis, and memory disambiguation. It is also one of the major components of DARPA and NSF's National Compiler Infrastructure Project. The Machine SUIF extensions (henceforth machsuif) add support for machine-dependent optimizations to SUIF; machsuif has been used for research in static branch prediction, code layout, instruction scheduling, and register allocation.
Halt and its utilities currently work with SUIF 1.1.2 and the machsuif 1.1.2 extensions. We support files in low SUIF and machsuif formats; we do not currently support high SUIF and we have not tested Halt with any of the beta releases of SUIF 2.0. We have used Halt almost exclusively on the Digital Alpha platform in machsuif; the MIPS platform was built for historical reasons and no longer supports active research at Harvard (although others are using it).
See How Halt works describes how Halt works. See Supported instrumentation describes the information that Halt can collect. See Applications lists applications where Halt has been used. See Porting Halt describes how to port Halt to new platforms and how to add new instrumentation types to Halt. See The haltsuif distribution describes the Halt distribution, which includes support for many of the more mature applications 1 .
If you are reading this document, then we assume that you are familiar with the SUIF compiler [ See R. Wilson, R. French, C. Wilson, S. Amarasinghe, J. Anderson, S. Tjiang, S. Liao, C. Tseng, M. Hall, M. Lam, and J. Hennessy. "SUIF: An Infrastructure for Research on Parallelizing and Optimizing Compilers," ACM SIGPLAN Notices, 29.12 (Dec. 1994): 31-37. ]. Since Halt and most of its applications were originally designed to work with machsuif [ See M. Smith. "Extending SUIF for Machine-dependent Optimizations," Proc. First SUIF Compiler Workshop. Stanford University, Jan. 1996. ], it is helpful to know machsuif. However, Halt also works with low SUIF, so one can browse the machsuif examples for ideas for the low SUIF domain.
All Halt users should read all of section 2. Sections See Supported instrumentation , See Applications , and See The haltsuif distribution can be read selectively based on the platform and application you have in mind. See Porting Halt need only be read if you want to use Halt with a new platform or if you want to add a new instrumentation type to Halt.
This section describes how Halt transforms a SUIF program into an instrumented SUIF program. There are three major steps: labeling the program, instrumenting the labeled points, and linking the program with the analysis routines.
Label annotations tell Halt where to place calls to analysis routines. Most label annotations are attached to SUIF instructions (so their analysis routine gets called whenever the instruction executes), but annotated procedure symbols (SUIF proc_sym 's) tell Halt to place an analysis call at all of a procedure's entry points and exits. The name of the label annotation (an ASCII string) determines the kind of instrumentation statistics that are collected at run-time and also determines the name of the analysis routine that will be called 2 .
In addition to specifying the kind of instrumentation to perform, the label annotation can also hold numeric values in the annotation's list of immediate values (SUIF immed_list 's). These additional values are compiled into this particular instrumentation point as parameters to be passed to the analysis routine. The most obvious use of these static values is to uniquely identify the current instrumentation point to the analysis routine. The static values can also be used to pass simulation information (e.g., cache line numbers, cycle counts, and instruction counts) to the analysis routines.
Labels can be applied in one of two ways: use the label pass that ships in the Halt distribution, or write your own SUIF pass that attaches labels to the program points in which you are interested. The label pass performs wholesale labeling and numbering of all instrumentation points of a specified kind in a program or procedure. If your research interests focus on particular parts of a program, then you most likely already have a SUIF pass that knows how to find these parts, so modifying your SUIF pass to add label annotations is simple.
Command line arguments to the label pass are described in See Supported instrumentation .
Once a program has been labeled, the halt pass inserts calls to analysis routines wherever it finds label annotations in the program. Calls to analysis routines need to be safe and they need to collect correct information. By safety, we mean that the inserted code that calls analysis routines must not change the semantics of the program by overwriting program state or throwing exceptions. Both safety and data collection must be handled in a machine-specific manner.
Maintaining safety requires a few implementation assumptions. In machsuif programs, halt currently assumes that all microprocessor state is described by the machsuif register implementation file and that it is safe to overwrite memory past the current top of the machine stack. Halt needs the first assumption to ensure that it preserves all values that might be overwritten by the analysis routine; the second assumption gives a convenient place to spill registers and to build a call frame for the analysis routine 3 . Safety in low SUIF is simpler: halt reads low SUIF programs in as flat lists, then promotes to variables any temporaries whose live ranges span the instrumentation point.
To obtain reasonable performance, halt performs live variable analysis on all registers at each instrumentation point. It spills all live registers on the stack before calling the analysis routine then restores all registers to their original values after the analysis routine returns. Note that even live temporary registers must be saved, because the analysis routine may destroy them when it executes, but the instrumented program does not know that a call has been inserted. Saving only live registers reduces the memory traffic around each instrumentation point.
After ensuring safety, halt then builds a call frame for the analysis routine. Depending on the kind of analysis routine, different run-time (dynamic) information will be passed as the first set of arguments to the routine. For example, conditional branch polarity, multiway branch target number, and load/store address and data might be gathered and passed as arguments. Each piece of dynamic information is collected in a machine-dependent manner. After the dynamic information, halt passes any static parameters that were part of the label annotation. Placing the dynamic information first gives some flexibility in analysis routines: all analysis routines for a particular kind of label annotation will always begin with the same type of arguments, yet the meaning of the static information can vary depending on the contract between the labeling pass and the analysis routines.
To ensure desired operation on some machines, halt will not always instrument the exact instruction that was labeled, and it might choose to instrument before or after the labeled instruction. Load instructions must be instrumented after they execute so that the load data can be seen; stores do not have this constraint. Conditional branch instructions must be instrumented before they execute to ensure that the analysis routine is called 4 . And multiway branches are currently instrumented at the point where the index into the multiway jump table is computed, rather than where the branch is actually executed; this requires support from the machsuif code generator to annotate the register that holds the index value. In all cases, we have tried to make halt pick the most intuitive or useful instrumentation points, but this may become an issue if you try to use halt in a new way.
After running halt on the labeled program, the result is a valid SUIF program that can be linked with analysis routines. As described in See Applications , halt comes with a number of useful analysis libraries. You are free to use these, to modify them to your purpose, or to build your own as you see fit. Analysis routines can be linked statically or dynamically; the standard linker and loader on your system are all that is required. Prototypes for the analysis routines are described in See Supported instrumentation .
This section describes the different program points that can be instrumented using Halt, the label annotations that make halt insert each of them, the dynamic information (and often the typical static information) that each supplies, and any special options to the label or halt passes associated with these annotations.
label [-p <proc>] [-unique <start>] [<labelopts>] [<in1> <out1>] ... [<inN> <outN>]
The label pass provides wholesale labeling and unique numbering of common instrumentation points in a SUIF file. The <labelopts> tell label where to place annotations; these options are described in the following subsections. The label pass operates on a list of filename pairs, reading each <in> file and writing each of the <out> files specified.
The label pass takes two global options. The " -p " option tells label the name of a single procedure which should be labeled during this run. Only instructions inside procedures that exactly match the name <proc> will be labeled. Note that some multiple-file programs have functions in different files with the same name; users of label must handle such complexities themselves.
The " -unique " option tells label the value at which to start assigning unique numbers; the default is zero. The label pass always assigns numbers sequentially within a single run, and it prints the lowest unused unique number to standard output at the end of its run. This feature allows you to run label on different files of a source program but still ensure globally unique numbers.
Each run of the label pass generates one sequence of contiguous unique numbers. So by specifying multiple <labelopts> for a single run, you can force different kinds of instrumentation points to share numbering spaces. For example, running
label -cbr -mbr hello.ag hello.al
would provide one numbering for all conditional and multway branches in a program. No conditional branch would share a number with a multiway branch. To have completely separate spaces, use separate runs of the label pass:
so that the first conditional branch and the first multiway branch will both be assigned the same unique number, zero.
halt [-main <main_func_name>] ...
No annotation is associated with this analysis call; it gets attached to any function called main() in any program instrumented using halt . Typically, _bt_startup() routines allocate space, initialize variables, and install signal handlers. All analysis libraries must define this function, even if it does no work.
Some programs, notably those translated using the SUIF f2c Fortran front-end, have a different name for their main function. Halt allows users to specify the name of this function using the "- main <main_func_name> " command-line option. So for such a Fortran program, users might specify, " -main MAIN__ ".
There is no _bt_cleanup() call. To ensure that a cleanup routine runs when the instrumented program exits, use an operating system function such as UNIX's atexit(2) .
ANNOTE(k_branch_unique_num, "branch_unique_num", TRUE);
ANNOTE(k_mbr_unique_num, "mbr_unique_num", TRUE);
extern void _bt_record(int cond,...);
extern void _mbr_record(int target,...);
These two annotations request instrumentation that records conditional branch directions and multiway branch index numbers, respectively. The dynamic information is just the conditional branch polarity ( cond ) or the multiway branch index ( target ), respectively. Polarity and index numbers are decided by the machsuif cfg library [ See G. Holloway and C. Young. "The Machine SUIF control flow graph and data flow analysis libraries," Proc. Second SUIF Compiler Workshop. Stanford University, Aug. 1997. ]; typically this means 0 for conditional branches that fall through, 1 for conditional branches that jump, and the index into the multiway branch jump table for multiway branches.
By convention (but not requirement), a unique branch number is the first static argument on the branch_unique_num and mbr_unique_num annotations; these unique numbers distinguish among different static branches or multiway branches in the program. The label pass provides exactly this unique branch number information. Users are free to add additional information after the unique branch numbers or to replace them with other information more relevant to their application.
ANNOTE(k_proc_unique_num, "proc_unique_num", TRUE);
ANNOTE(k_ret_unique_num, "ret_unique_num", TRUE);
extern void _bt_call(unsigned proc_num,...);
extern void _bt_return(unsigned proc_or_ret_num,...);
The behavior of label and halt on procedure calls and returns is not quite intuitive. In the label pass, the -proc option specifies that all proc_sym 's should receive proc_unique_num annotations, each with a unique number. If the additional -ret option is specified, then all return instructions also get a ret_unique_num annotation with their own unique number; without the -ret annotation, no label annotations are put on return instructions.
When halt runs, it checks each procedure for a proc_unique_num annotation on its proc_sym . Any such procedure gets call and return instrumentation on all of its entry and exit points, respectively. The procedure's entry point will always receive the procedure's unique number passed as the first static argument. But return instructions get different first static argument depending on whether a ret_unique_num annotation is present. If one is present, then the return gets the specified unique number as its first static argument. Otherwise, the return gets the procedure's unique number as its first static argument.
This unintuitive behavior was designed to allow Halt users to distinguish among different return points in a procedure. However, it does not satisfactorily address procedures with multiple entry points. We have not yet needed to handle such cases, but a more general strategy will be needed when we encounter them.
ANNOTE(k_setjmp_call, "setjmp_call", TRUE);
ANNOTE(k_longjmp_call, "longjmp_call", TRUE);
extern void _bt_setjmp(jmp_buf);
extern void _bt_longjmp(jmp_buf, int);
Some analysis packages need to track the currently-executing procedure. For example, the Halt cache simulator relies on the current procedure and an offset into the procedure to generate effective addresses. But the setjmp() and longjmp() routines change the control location from one program counter and stack pointer to another. To correctly keep track of the current procedure, these calls need to be visible to the analysis routines.
The _bt_setjmp() and _bt_longjmp() analysis routines take no user-specified static arguments. Instead, they get the same arguments that are about to be passed to the actual setjmp() and longjmp() routines. By remembering and mimicking the effects of setjmp() and longjmp() on execution, the analysis routines can tell where the program is executing.
The label pass will annotate any calls to routines named setjmp() or longjmp() that it sees if the -setlongjmp option is specified. The halt pass adds matching calls to the _bt_setjmp() or _bt_longjmp() routines wherever it finds setjmp_call or longjmp_call annotations.
ANNOTE(k_load_unique_num, "load_unique_num", TRUE);
ANNOTE(k_store_unique_num, "store_unique_num", TRUE);
extern void _ld_record(unsigned num, long int ea,...);
extern void _st_record(unsigned num, long int ea,...);
extern void _ld_record(unsigned long ea, unsigned long data);
extern void _st_record(unsigned long ea, unsigned long data);
The load and store annotations have two dynamic components: the effective address ( ea ) and the value loaded or stored ( data ). As of this writing, this interface is old and hasn't been used in a while; it probably doesn't work right now.
ANNOTE(k_bb_unique_num, "bb_unique_num", TRUE);
ANNOTE(k_cycle_count, "cycle_count", TRUE);
These two analysis calls have no dynamic information associated with them; users must fill in the static information depending on their application. The label application will uniquely number the first instruction of each basic block in each procedure when given the -bb option. Basic blocks are determined by using the machsuif CFG library [ See G. Holloway and C. Young. "The Machine SUIF control flow graph and data flow analysis libraries," Proc. Second SUIF Compiler Workshop. Stanford University, Aug. 1997. ]. The label pass never generates cycle_count annotations; this annotation and the corresponding call were added to support cycle count, instruction count, and cache simulation for instruction scheduling research.
This section describes a number of ways that Halt has been used.
The eventinfo pass in the Halt distribution is not actually an application; instead, it provides a convenient text file format and data structure that allow SUIF passes to verify the integrity of profiles and to connect the unique numbers generated by the label pass back to instructions in SUIF files.
The eventinfo pass takes only a list of SUIF input files as arguments. It scans these input files and prints one line for each control-related label annotation that it finds. Lines contain a single starting character, at least two integers, optional additional integers, and an optional final string. The first character is one of the letters p , r , c , or m , depending on whether the label annotation was a procedure, return, conditional branch, or multiway branch, respectively. The first integer is the unique number assigned to the annotation. The second integer states how many CFG successors the instruction has (usually 1 for procedures, 0 for returns, and 2 for conditional branches). Additional integers list the unique numbers of the CFG successors of the label instruction. The terminal string appears only if the event was a procedure, and it takes the form, <file_name>:<procedure_name> .
Here is an example of the output of eventinfo for a simple program with an if/then/else inside of a do/while loop:
The output of the eventinfo pass is a textual representation of the CFG of each procedure. The first line indicates that procedure main() in the source file simple.c got unique number 0 , that there was exactly one entry point to the procedure (the first 1 on the line) and that the first control instruction reached after entering main() had unique number 1 (the second 1 on the line). The second line is the conditional branch inside the loop: its unique number is 1 and it has two successors, both of which are the loop conditional branch with unique number 2 . The third line is the loop conditional branch with unique number 2 ; its two successors are the return instruction and the conditional in the body of the loop. The last line indicates the return instruction has unique number 3 and has no CFG successors.
Some of the Halt profilers and simulators and all three optimizations use the output of eventinfo .
Halt was initially designed as a profiling tool, so unsurprisingly there are a number of pieces of code that support tracing and profiling that are part of the distribution.
Perhaps the simplest kind of instrumentation is tracing all events of a certain kind in a program. And debugging using print statements is a time-honored practice. Using the inst/inst-print. c instrumentation file and exhaustive conditional branch labeling, Halt will modify your program so that it prints the unique number and direction of all conditional branches to stderr .
As a baseline for my work comparing point profiling with path profiling techniques, I built a couple of simple node and edge profilers. These require that multiway and conditional branches be labeled and that you link with libpath/libpath.{a,so} and with inst/inst-node.c or inst/inst-edge.c . Both of these analysis packages expect to find a file called branch.info in the current directory that holds the output of the eventinfo pass. When the program completes, frequency statistics identified by event numbers are written to node.info or edge.info , respectively.
The Halt distribution ships with a copy of the path profiling library that I built for my thesis. Unlike the other, single-file profilers, the path profiler has multiple source files, takes up its own directory ( libpath ), and links as a library without another object file (all three use library routines for handling the output of eventinfo; libpath contains the path profiling analysis routines, which are not used when you link with inst-node.c or inst-edge.c ). Like the node and edge profilers, the path profiler also expects to find a file called branch.info in the current directory that holds the output of the eventinfo pass. And similarly to them, it produces a text file with path frequency information called path.info .
The path profiler uses an environment variable, K_VALUE , to determine the kind of paths to collect. If K_VALUE is set to a non-negative number, then the path profiler collects paths of that history depth. 5 When K_VALUE is set to -1, the profiler collects frequencies for simple paths, which are paths that have no duplicated nodes. When K_VALUE is set to -2, the profiler collects frequencies for trails, which are paths that have no duplicated edges. The latter two options are experimental, and turned out not to be used in my research.
The path profiler has some implementation flaws that could be fixed or made more general. In particular, it uses many inheritance features of C++; this leads to a large amount of data structure bloat. It also includes a number of debugging assertions that are expensive to check and could be turned off to improve performance. One other important implementation aspect is that paths are stored in a compact form where the sequence of branch targets is represented using a packed bit representation that must fit into a machine register. This places an upper bound on history depth of 32 or 64 on many machines; actual representable paths may be shorter because multiway branches consume more bits in the packed representation.
Because of its Atom-like approach, Halt also turns out to be useful for building simulators. This subsection discusses the branch prediction scheme simulator, instruction cache simulator, support for processor cycle simulation, and support for emulating non-excepting instructions.
Doing branch prediction hardware research is becoming easier and easier. Tools like Atom and Halt provide trivial access to the stream of conditional branches and directions. One need only model the branch prediction hardware, then compare the predictions with the outcomes of branches. The hsim directory includes a number of classes and definitions that parameterize branch prediction schemes. Hsim is designed to run either online through pipes from a branch traced program or offline using a trace that has been stored to disk. Online operation saves disk space, while offline operation ensures repeatability of experiments when different branch prediction schemes are tried.
[CLIFF: clean up this directory, document how to set the pipes up or store things]
Like branch prediction simulation, instruction cache simulation is also very simple. Halt provides indirect access to instruction counter values when branches or basic blocks are instrumented. For the cache simulations in the distribution, we piggyback the cache simulation on the existing instrumentation points for conditional and multiway branches. To do this, another pass ( analyze_cache for branch prediction and super for instruction scheduling research) adds extra static information to each conditional branch or multiway branch label annotation. To capture cache operations from blocks that do not end with a branch, the additional pass also adds new cycle_count annotations to such blocks. The static information indicates offset from the start of the procedure in instructions and the number of sequential instruction reads that were performed for the block.
The instruction cache simulator is a set of functions that are bundled into both scbp3/inst-predict.c and super/inst-cycle.c , the analysis routines for branch prediction research and for scheduling research. We have not separated them out, but it would be simple to cut-and-paste to do so.
The cache simulator expects to find a file called "src /proc.addr " that lists all of the procedures in the program and their starting addresses. The environment variables HALT_CACHE_BYTES and HALT_CACHE_LINESIZE describe the size of the cache in bytes (instructions are assumed to be 4 bytes long) and the size of a cache line in bytes, respectively. Only direct-mapped caches are modeled, but associativity could be modeled by augmenting the current code. When the program runs, the analysis routines add the procedure start addresses to the offsets of each block to generate cache addresses. These cache addresses drive the cache simulation. When the program exits, a summary of cache performance (raw hits, raw misses, and miss rate) is written to standard error.
For instruction scheduling research, it is useful to count the number of cycles and instructions that it takes to complete a program. Simulating this is typically done simplistically, by recording these counts per block in the program, then running an instrumented program to multiply each of these counts by the execution frequency of its corresponding block. In addition to the cache information supplied above, the super pass also attaches static information about cycle and instruction counts to the same set of instrumentation points. Then inst-cycle.c , the analysis package for processor cycle simulation, totals these counts at runtime.
At the end of the program run, inst-cycle.c prints the cycle and instruction totals to standard error.
Many instruction scheduling research projects assume an architecture that supports non-excepting variants of instructions [ See W. Hwu, et al. "The Superblock: An Effective Technique for VLIW and Superscalar Compilation." The Journal of Supercomputing 7(1/2). Boston: Kluwer Academic Publishers, May 1993. ]. But no processor currently supports a full set of such instructions. This seems a problem: running on real hardware is desirable because hardware is fast, but we cannot change the hardware to support new instructions.
We use compiled simulation to support non-excepting instructions. Compiled simulation uses hardware wherever possible to execute instructions, then emulates unimplemented features through software. To emulate non-excepting instructions, the analysis routine _bt_startup() installs a handler for the UNIX segmentation violation exception ( SIGSEGV ). 6 The handler suppresses such exceptions, allowing the program to continue execution. This emulation strategy is not complete: it supresses all exceptions, not just the ones caused by speculative code motion. If the original program threw an exception, this emulation strategy will mask that in the emulated program. But for research purposes on stable benchmarks, this is not an issue.
Depending on how many exceptions must be suppressed this way, compiled simulating code runs with varying degrees of slowdown. But it seems likely to do better than interpreting every instruction in a software simulator. For more details on compiled simulation, please see my thesis.
I built a number of optimizations that use the profiles collected by Halt. Each is described below, but they are not currently shipped with the Halt distribution. If you are interested in testing or experimenting with these optimizations, please contact me. The bbp pass was used for research in optimal branch alignment; the scbp and super passes were used in my thesis for path-based branch prediction and path-based superblock scheduling, respectively.
Basic block placement is one form of code placement, an optimization that attempts to reduce either cache misses or pipeline stalls. The latter form of code placement problem is called branch alignment. The bbp pass supports Pettis and Hansen's style of basic block placement [ See K. Pettis and R. C. Hansen. "Profile Guided Code Positioning," Proc. ACM SIGPLAN'90 Conf. on Prog. Lang. Design and Impl. New York: ACM, June 1990. ]; it also generates Traveling Salesman Problems (TSPs) by reducing branch alignment to TSP. For our research, we used a separate TSP solver and lower-bound package; we hope to integrate the TSP analysis and lower bounds into the bbp pass eventually.
bbp [-proc <proc_name>] [-train <profile>] [-test <profile>]
[-pre|post edge|dense|sparse|layout|cfg|vcg|freq]
[-layout greedy|try15|iter3|<file>] [-invert_unums]
<in0> <out0> ... <inN> <outN>
There are many options to the bbp pass. The -proc option specifies the name of a single procedure to lay out; all other procedures are ignored. The -train and -test options specify the files that hold training and testing profiles; these profiles have the format produced by the edge profiling instrumentation. The training profile is used by profile-driven layout decisions; the testing profile is used to calculate how well the layout decisions work on a different run of the program. The -pre and -post options specify what information (if any) should be printed out about the program before and after layout. The edge option prints edge frequencies from the training profile. The dense option prints out a TSP matrix for each procedure that describes the branch alignment problem. The layout option prints the CFG block numbers in layout order. The cfg option prints out the control flow graph in text form. The vcg option prints out a file that can be viewed using the VCG (Visualization of Compiler Graphs) tool [ See G. Sander. "Graph Layout through the VCG Tools," Technical report A03/94, Universitat des Saarlandes, FB 14 Informatik, 1994 available via ftp from ftp.cs.uni-sb.de, file /pub/graphics/vcg/doc/tr-A03-94.ps.gz. ]. And the freq option prints the edge frequency matrix; this is an NxN matrix that holds the execution frequency that the jth block executed immediately after the ith block at position (i, j).
The next set of options control how bbp transforms the input program files. The -layout greedy option performs Pettis and Hansen-style basic block placement. The try15 and iter3 options are not currently implemented; they will eventually support Calder and Grunwald's try15 algorithm and an integrated version of the TSP-based code layout solution that I developed with David Johnson, David Karger, and Michael D. Smith [ See C. Young, D.S. Johnson, D.R. Karger, and M.D. Smith. "Near-optimal Intraprocedural Branch Alignment," Proc. ACM SIGPLAN `97 Conf. on Prog. Lang. Design and Impl. New York: ACM, June 1997. ]. The -layout <file> option looks for a matching file name that lists permutations of the blocks of the procedure; this allowed the separate TSP solver to provide layouts in our research. The -invert_unums option specifies that conditional branches whose tests have been inverted by the layout step should have their unique numbers negated; this allows correct branch polarity statistics to be collected by later passes.
Static correlated branch prediction uses path profile information to transform a program so that branches are more strongly biased. It does this by duplicating blocks in the program, then connecting the new copies so that paths with strong biases connect to blocks with matching static predictions. SCBP trades off space for time: the resulting program is larger (and thus harder for the memory hierarchy to handle) but more predictable (and thus easier on the fetch unit of a modern microprocessor).
SCBP is implemented as three passes: the first analyzes the CFG of each procedure and builds a summary file, the second interprocedurally decides where to duplicate blocks in the program to trade off code expansion for predictability, and the last performs the transformations on the SUIF code that were selected by the second pass.
scbp2 [-bias <thresh>] [-over <cumul> <marg>] [-allow_mbrs]
<in0> ... <inN>
scbp3 <scbp2out> <in0> <out0> ... <inN> <outN>
The scbp1 pass expects a single SUIF file as its argument. It produces a summary on standard output. The summaries of all the files in the program should be concatenated together into a file called funnel.info before scbp2 runs. scbp2 needs to read all of the SUIF files that make up the program, but it does not write any of them. The -bias and -over options specify floating point numbers that tune the kind of pruning performed by scbp2 ; the defaults are 0.5 for <thresh> , 0.95 for <cumul> , and 0.01 for <marg> . The -allow_mbrs flag enables correlated multiway branch prediction. scbp3 expects the standard output from scbp2 as its first argument.
Many instruction schedulers use profiles to direct their attention to frequently executed portions of the program, where it is worth paying in terms of program size and optimization time to make the final program run faster. Path profiles give more detailed information about the shape of hot regions in the program, and they naturally match the scheduling regions of trace schedulers [ See J.A. Fisher. "Trace Scheduling: A Technique for Global Microcode Compaction." IEEE TOCS C-30.7 (July 1981): 478-490. , See P. Lowney, S. Freudenberger, T. Karzes, W. Lichtenstein, R. Nix, J. O'Donnell, and J. Ruttenbeg. "The Multiflow Trace Scheduling Compiler," The Journal of Supercomputing 7(1/2). Boston: Kluwer Academic Publishers, May 1993. ] and their relatives, superblock schedulers [ See W. Hwu, et al. "The Superblock: An Effective Technique for VLIW and Superscalar Compilation." The Journal of Supercomputing 7(1/2). Boston: Kluwer Academic Publishers, May 1993. ]. For my thesis I built a superblock scheduler, then I showed how forming superblocks using path profiles improved performance over traditional point profiles.
select -manual|basic|mutual|path [-taildup]
[-enlarge] [-ENLARGE <P> <L> <T>] [-wholesale]
[-procfile <proclist>] [-edge] [-vcg]
<profile> <in0> <out0> ... <inN> <outN>
The select pass forms superblocks. Forming superblocks involves two things: marking the instructions in the superblock with the same sched_region annotation and number, and possibly duplicating code to form larger superblocks.
The first argument to select determines how initial superblocks are formed. The -manual option invokes a simple interactive interface that chooses superblocks. The -basic option makes each maximal basic block (ignoring calls) a superblock. The -mutual option uses the profile-driven mutual-most-likely heuristic to choose superblocks, and the -path option uses path profiles to choose superblocks. The -taildup option tells select to perform tail duplication, which removes side entrances to the initially-selected superblocks by making extra copies of blocks.
The next set of options determine how to enlarge superblocks; superblock enlarging optimizations attempt to make more instructions available to the compaction step by making extra copies of blocks. The -enlarge option performs superblock loop peeling, superblock loop unrolling, and branch target expansion. The -ENLARGE option allows each of these to be switched on and off individually; the <P> argument is the threshold between peeling and unrolling a loop; the <L> argument is the maximum number of times that a loop can be unrolled; and the <T> argument is a boolean flag that enables or disables branch target expansion. The -wholesale option unrolls only frequently-executed innermost loops, but it copies the entire loop body rather than a single superblock in the loop.
The -procfile option specifies a file that lists procedures that should have superblocks formed; if the current procedure's name does not occur in the file then each basic block will be marked as a superblock and no superblock enlarging optimizations will be performed. The -procfile option allows selective formation of superblocks; since superblock formation expands the program, it is best used sparingly. The -edge option tells select to add branch_edge_weight annotations to each conditional branch in the program based on the profile information. And the -vcg option tells select to produce a VCG file for each procedure that it processes.
The select pass expects a profile of the form produced by the path profiler. If your work requires only point profiles, run the path profiler with the environment variable K_VALUE set to 0. This produces an edge profile in the proper format.
super -local -simple -cache -cycle
-proc <proc> -proc_range <min> <max>
-region <num> -reg_range <min> <max>
The super pass compacts the superblocks formed by the select pass, seeking to shorten the cycle count of the program by packing each cycle with useful work. The -local option tells super to ignore any sched_region annotations and schedule each basic block by itself (local scheduling), instead. The -simple option specifies that super should use a simplified resource model rather than the accurate target machine model when scheduling and computing cycle and instruction counts. The -cache and -cycle options indicate whether cache or cycle simulation annotations should be attached to the scheduled program.
The next four lines of options allow the user to selectively enable and disable scheduling in a variety of scopes. These switches turn out to be invaluable when debugging incorrect scheduling motions; binary search for the part of the program that was incorrectly scheduled can very quickly isolate seemingly nasty problems. The -proc option indicates a single procedure to schedule. The -proc_range option enables scheduling on procedures that are lexically greater than or equal to <min> and less than <max> . The -region option selects a single region number (superblock number) to schedule; the -reg_range option selects a range of region numbers. The -reg_list option expects a list of region numbers terminated by the keyword end . The -block_list option expects a list of block indices within the region to schedule; this list must also be terminated by the keyword end .
The last line of options to super enable or disable three types of register renaming. Note that these renaming optimizations work only in the prescheduling pass, before register allocation; super has no support for scavenging unused physical registers after register allocation has been performed. The -move option enables move renaming, where instructions that use the target of a move operation are modified to use the source of the move instead; this allows those dependent instructions to issue earlier. The poorly-named -rename option enables renaming for registers that are live off-trace; this enables some upward code motions that would otherwise destroy important program state. And the -antiout option enables general renaming to remove anti and output data dependencies.
There are two ways to extend Halt's functionality: port it to a new platform, or add a new kind of instrumentation. They require different kinds of work.
Halt depends on other pieces of infrastructure to work on a new platform. In machsuif, a new target requires representations of machine instructions, implementation specifics (data files that describe registers and other machine parameters), and a number of new machine-specific passes (code generation from low SUIF, assembly-machine expansion, and stack frame finishing for example). Once these pieces exist, porting Halt to work with them is not very difficult.
The most complex task in porting is the "sandboxing" necessary to make the call to the analysis routine safe at the instrumentation point. This entails generating instructions that save register state in a safe place, set up the call frame for the analysis routine, then restore the registers after the analysis routine completes. It is not too much more difficult to add live variable analysis to perform the minimal set of saves and restores. The halt pass source files are factored into a machine-independent file ( halt.cc ) and several platform-specific files ( halt-alpha.cc , halt-mips.cc , etc.). The easiest way to perform a port is to copy and modify an existing machine-specific file, then add the matching dispatch hooks in to the machine-independent file to call the modified machine-dependent routines.
Adding new kinds of instrumentation to Halt requires three things: a new annotation name, a function prototype for the analysis routine that describes the dynamic parameters, and installing a routine that inserts instructions that pass the dynamic parameters inside of the sandboxing code. This is also reasonably modularly separated in the halt.cc and halt-*.cc files. However, there are a number of subtleties in generating the code that collects dynamic information, primarily having to do with saving or overwriting registers that hold needed data. For example, you may need scratch registers to synthesize a dynamic value, e.g., a branch polarity of 0 or 1. The Alpha code that captures branch polarities keeps track of the registers that hold the branch conditions and generates different code depending on where values start, so that the correct value is always produced.
The label/, eventinfo/, and halt/ directories hold sources for the SUIF passes of the same names; these are the central programs in the Halt distribution.
The inst/, hsim/, directories hold miscellaneous analysis routines and the branch prediction simulator, respectively. The libpath/ directory builds a library, libpath.{a,so} that provides code for the event data structure and also holds the path profiling analysis routines.
The bbp/, scbp[123]/, select/ and super/ diretories contain the source files for optimization passes. The analyze_cache/ directory holds a stand-alone program that computes instruction offsets from the start of each procedure (assuming the current instruction ordering in the procedure), so that the analysis routines that simulate instruction caches know the offset to add to each procedure's starting address. The file scbp3/inst-predict.c holds the combined branch prediction and cache simulator used for SCBP research. The file super/inst-cycle.c holds the simulator that performs instruction and cycle counting, cache simulation, and emulation of non-excepting instructions.
Halt provides a flexible way to add flexible analysis routines to programs produced using the SUIF compiler.
Tony Dewitt built the first versions of the Alpha port, the live variable analysis, and support for loads and stores. Glenn Holloway has fixed a number of bugs.
1. Please contact the author if you are interested in beta-testing the applications that are not yet included in the distribution.
2. The name of the label annotation and the name of the analysis routine are not currently the same; pairings between annotation and function names are currently wired into the code for the halt pass. This might be fixed later if there is sufficient demand.
3. Some legacy x86 programs violate the second assumption, saving or using values beyond the top of the stack. This issue has not yet been handled in halt `s implementation.
4. The alternative would be to instrument both the fall-through and taken paths of the branch, and even this is not sufficient if two conditional branches jump to the same place.
5. History depth is roughly equivalent to the number of prior branch instructions that are recorded in a path; a history depth of 0 is equivalent to edge profiling. See my thesis [ See C. Young. "Path-based Compilation." Diss. Harvard University, Jan. 1998. ] for more formal definitions of history depth and much more information about path profiling.