39.1 The Debug-Info Structure

The Debug-Info structure directly represents information about the source code, and points to other structures that describe the layout of run-time data structures.

Make some sort of minimal debug-info format that would support at least the common cases of level 1 (since that is what we would release), and perhaps level 0. Actually, it seems it wouldn’t be hard to crunch nearly all of the debug-function structure and debug-info function map into a single byte-vector. We could have an uncrunch function that restored the current format. This would be used by the debugger, and also could be used by purify to delete parts of the debug-info even when the compiler dumps it in crunched form. [Note that this isn’t terribly important if purify is smart about debug-info...]

Compiled source map representation:

[### store in debug-function PC at which env is properly initialized, i.e. args (and return-pc, etc.) in internal locations. This is where a :function-start breakpoint would break.]

[### Note that that we can easily cache the form-number => source-path or form-number => form translation using a vector indexed by form numbers that we build during a walk.]

Instead of using source paths in the debug-info, use “form numbers”. The form number of a form is the number of forms that we walk to reach that form when doing a pre-order walk of the source form. [Might want to use a post-order walk, as that would more closely approximate evaluation order.]

We probably want to continue using source-paths in the compiler, since they are quick to compute and to get you to a particular form. [### But actually, I guess we don’t have to precompute the source paths and annotate nodes with them: instead we could annotate the nodes with the actual original source form. Then if we wanted to find the location of that form, we could walk the root source form, looking that original form. But we might still need to enter all the forms in a hashtable so that we can tell during IR1 conversion that a given form appeared in the original source.]

Note that form numbers have an interesting property: it is quite efficient to determine whether an arbitrary form is a subform of some other form, since the form number of B will be > than A’s number and < A’s next sibling’s number iff B is a subform of A.

This should be quite useful for doing the source=>pc mapping in the debugger, since that problem reduces to finding the subset of the known locations that are for subforms of the specified form.

Assume a byte vector with a standard variable-length integer format, something like this:

    0..253 => the integer
    254 => read next two bytes for integer
    255 => read next four bytes for integer

Then a compiled debug block is just a sequence of variable-length integers in a particular order, something like this:

    number of successors
    ...offsets of each successor in the function's blocks vector...
    first PC
    [offset of first top-level form (in forms) (only if not component default)]
    form number of first source form
    first live mask (length in bytes determined by number of VARIABLES)
    ...more <PC, top-level form offset, form-number, live-set> tuples...

We determine the number of locations recorded in a block by finding the start of the next compiled debug block in the blocks vector.

[### Actually, only need 2 bits for number of successors {0,1,2}. We might want to use other bits in the first byte to indicate the kind of location.] [### We could support local packing by having a general concept of “alternate locations” instead of just regular and save locations. The location would have a bit indicating that there are alternate locations, in which case we read the number of alternate locations and then that many more SC-OFFSETs. In the debug-block, we would have a second bit mask with bits set for TNs that are in an alternate location. We then read a number for each such TN, with the value being interpreted as an index into the Location’s alternate locations.]

It looks like using structures for the compiled-location-info is too bulky. Instead we need some packed binary representation.

First, let’s represent an SC/offset pair with an “SC-Offset”, which is an integer with the SC in the low 5 bits and the offset in the remaining bits:

    ----------------------------------------------------
    | Offset (as many bits as necessary) | SC (5 bits) |
    ----------------------------------------------------

Probably the result should be constrained to fit in a fixnum, since it will be more efficient and gives more than enough possible offsets.

We can then represent a compiled location like this:

    single byte of boolean flags:
	uninterned name
	packaged name
	environment-live
	has distinct save location
        has ID (name not unique in this fun)
    name length in bytes (as var-length integer)
    ...name bytes...
    [if packaged, var-length integer that is package name length]
     ...package name bytes...]
    [If has ID, ID as var-length integer]
    SC-Offset of primary location (as var-length integer)
    [If has save SC, SC-Offset of save location (as var-length integer)]

But for a whizzy breakpoint facility, we would need a good source=>code map. Dumping a complete code=>source map might be as good a way as any to represent this, due to the one-to-many relationship between source and code locations.

We might be able to get away with just storing the source locations for the beginnings of blocks and maintaining a mapping from code ranges to blocks. This would be fine both for the profiler and for the “where am I running now” indication. Users might also be convinced that it was most interesting to break at block starts, but I don’t really know how easily people could develop an understanding of basic blocks.

It could also be a bit tricky to map an arbitrary user-designated source location to some “closest” source location actually in the debug info. This problem probably exists to some degree even with a full source map, since some forms will never appear as the source of any node. It seems you might have to negotiate with the user. He would mouse something, and then you would highlight some source form that has a common prefix (i.e. is a prefix of the user path, or vice-versa.) If they aren’t happy with the result, they could try something else. In some cases, the designated path might be a prefix of several paths. This ambiguity might be resolved by picking the shortest path or letting the user choose.

At the primitive level, I guess what this means is that the structure of source locations (i.e. source paths) must be known, and the source=>code operation should return a list of <source,code> pairs, rather than just a list of code locations. This allows the debugger to resolve the ambiguity however it wants.

I guess the formal definition of which source paths we would return is:

All source paths in the debug info that have a maximal common prefix with the specified path. i.e. if several paths have the complete specified path as a prefix, we return them all. Otherwise, all paths with an equally large common prefix are returned: if the path with the most in common matches only the first three elements, then we return all paths that match in the first three elements. As a degenerate case (which probably shouldn’t happen), if there is no path with anything in common, then we return *all* of the paths.

In the DEBUG-SOURCE structure we may ultimately want a vector of the start positions of each source form, since that would make it easier for the debugger to locate the source. It could just open the file, FILE-POSITION to the form, do a READ, then loop down the source path. Of course, it could read each form starting from the beginning, but that might be too slow.

Do XEPs really need Debug-Functions? The only time that we will commonly end up in the debugger on an XEP is when an argument type check fails. But I suppose it would be nice to be able to print the arguments passed...

Note that assembler-level code motion such as pipeline reorganization can cause problems with our PC maps. The assembler needs to know that debug info markers are different from real labels anyway, so I suppose it could inhibit motion across debug markers conditional on policy. It seems unworthwhile to remember the node for each individual instruction.

For tracing block-compiled calls:

    Info about return value passing locations?
    Info about where all the returns are?

We definitely need the return-value passing locations for debug-return. The question is what the interface should be. We don’t really want to have a visible debug-function-return-locations operation, since there are various value passing conventions, and we want to paper over the differences.

Probably should be a compiler option to initialize stack frame to a special uninitialized object (some random immediate type). This would aid debugging, and would also help GC problems. For the latter reason especially, this should be locally-turn-onable (off of policy? the new debug-info quality?).

What about the interface between the evaluator and the debugger? (i.e. what happens on an error, etc.) Compiler error handling should be integrated with run-time error handling. Ideally the error messages should look the same. Practically, in some cases the run-time errors will have less information. But the error should look the same to the debugger (or at least similar).