Packing (Design of CMU Common Lisp)

Next: Code generation, Previous: Lifetime analysis, Up: Design of CMU Common Lisp [Contents]

21 Packing ¶

File: pack

Add lifetime/pack support for pre-packed save TNs.

Fix GTN/VMR conversion to use pre-packed save TNs for old-cont and return-PC. (Will prevent preference from passing location to save location from ever being honored?)

We will need to make packing of passing locations smarter before we will be able to target the passing location on the stack in a tail call (when that is where the callee wants it.) Currently, we will almost always pack the passing location in a register without considering whether that is really a good idea. Maybe we should consider schemes that explicitly understand the parallel assignment semantics, and try to do the assignment with a minimum number of temporaries. We only need assignment temps for TNs that appear both as an actual argument value and as a formal parameter of the called function. This only happens in self-recursive functions.

Could be a problem with lifetime analysis, though. The write by a move-arg VOP would look like a write in the current env, when it really isn’t. If this is a problem, then we might want to make the result TN be an info arg rather than a real operand. But this would only be a problem in recursive calls, anyway. [This would prevent targeting, but targeting across passing locations rarely seems to work anyway.] [### But the :ENVIRONMENT TN mechanism would get confused. Maybe put env explicitly in TN, and have it only always-live in that env, and normal in other envs (or blocks it is written in.) This would allow targeting into environment TNs.

I guess we would also want the env/PC save TNs normal in the return block so that we can target them. We could do this by considering env TNs normal in read blocks with no successors.

ENV TNs would be treated totally normally in non-env blocks, so we don’t have to worry about lifetime analysis getting confused by variable initializations. Do some kind of TN costing to determine when it is more trouble than it is worth to allocate TNs in registers.

Change pack ordering to be less pessimal. Pack TNs as they are seen in the LTN map in DFO, which at least in non-block compilations has an effect something like packing main trace TNs first, since control analysis tries to put the good code first. This could also reduce spilling, since it makes it less likely we will clog all registers with global TNs.

If we pack a TN with a specified save location on the stack, pack in the specified location.

Allow old-cont and return-pc to be kept in registers by adding a new “keep around” kind of TN. These are kind of like environment live, but are only always-live in blocks that they weren’t referenced in. Lifetime analysis does a post-pass adding always-live conflicts for each “keep around” TN to those blocks with no conflict for that TN. The distinction between always-live and keep-around allows us to successfully target old-cont and return-pc to passing locations. MAKE-KEEP-AROUND-TN (ptype), PRE-PACK-SAVE-TN (tn scn offset). Environment needs a KEEP-AROUND-TNS slot so that conflict analysis can find them (no special casing is needed after then, they can be made with :NORMAL kind). VMR-component needs PRE-PACKED-SAVE-TNS so that conflict analysis or somebody can copy conflict info from the saved TN.

Note that having block granularity in the conflict information doesn’t mean that a localized packing scheme would have to do all moves at block boundaries (which would clash with the desire to have saving done as part of this mechanism.) All that it means is that if we want to do a move within the block, we would need to allocate both locations throughout that block (or something).

Load TN pack:

A location is out for load TN packing if:

The location has TN live in it after the VOP for a result, or before the VOP for an argument, or

The location is used earlier in the TN-ref list (after) the saved results ref or later in the TN-Ref list (before) the loaded argument’s ref.

To pack load TNs, we advance the live-tns to the interesting VOP, then repeatedly scan the vop-refs to find vop-local conflicts for each needed load TN. We insert move VOPs and change over the TN-Ref-TNs as we go so the TN-Refs will reflect conflicts with already packed load-TNs.

If we fail to pack a load-TN in the desired SC, then we scan the Live-TNs for the SB, looking for a TN that can be packed in an unbounded SB. This TN must then be repacked in the unbounded SB. It is important the load-TNs are never packed in unbounded SBs, since that would invalidate the conflicts info, preventing us from repacking TNs in unbounded SBs. We can’t repack in a finite SB, since there might have been load TNs packed in that SB which aren’t represented in the original conflict structures.

Is it permissible to “restrict” an operand to an unbounded SC? Not impossible to satisfy as long as a finite SC is also allowed. But in practice, no restriction would probably be as good.

We assume all locations can be used when an sc is based on an unbounded sb.

]

TN-Refs are convenient structures to build the target graph out of. If we allocated space in every TN-Ref, then there would certainly be enough to represent arbitrary target graphs. Would it be enough to allocate a single Target slot? If there is a target path through a given VOP, then the Target of the write ref would be the read, and vice-versa. To find all the TNs that target us, we look at the TN for the target of all our write refs.

We separately chain together the read refs and the write refs for a TN, allowing easy determination of things such as whether a TN has only a single definition or has no reads. It would also allow easier traversal of the target graph.

Represent per-location conflicts as vectors indexed by block number of per-block conflict info. To test whether a TN conflicts on a location, we would then have to iterate over the TNs global-conflicts, using the block number and LTN number to check for a conflict in that block. But since most TNs are local, this test actually isn’t much more expensive than indexing into a bit-vector by GTN numbers.

The big win of this scheme is that it is much cheaper to add conflicts into the conflict set for a location, since we never need to actually compute the conflict set in a list-like representation (which requires iterating over the LTN conflicts vectors and unioning in the always-live TNs). Instead, we just iterate over the global-conflicts for the TN, using BIT-IOR to combine the conflict set with the bit-vector for that block in that location, or marking that block/location combination as being always-live if the conflict is always-live.

Generating the conflict set is inherently more costly, since although we believe the conflict set size to be roughly constant, it can easily contain tens of elements. We would have to generate these moderately large lists for all TNs, including local TNs. In contrast, the proposed scheme does work proportional to the number of blocks the TN is live in, which is small on average (1 for local TNs). This win exists independently from the win of not having to iterate over LTN conflict vectors.

[### Note that since we never do bitwise iteration over the LTN conflict vectors, part of the motivation for keeping these a small fixed size has been removed. But it would still be useful to keep the size fixed so that we can easily recycle the bit-vectors, and so that we could potentially have maximally tense special primitives for doing clear and bit-ior on these vectors.]

This scheme is somewhat more space-intensive than having a per-location bit-vector. Each vector entry would be something like 150 bits rather than one bit, but this is mitigated by the number of blocks being 5-10x smaller than the number of TNs. This seems like an acceptable overhead, a small fraction of the total VMR representation.

The space overhead could also be reduced by using something equivalent to a two-dimensional bit array, indexed first by LTN numbers, and then block numbers (instead of using a simple-vector of separate bit-vectors.) This would eliminate space wastage due to bit-vector overheads, which might be 50% or more, and would also make efficient zeroing of the vectors more straightforward. We would then want efficient operations for OR’ing LTN conflict vectors with rows in the array.

This representation also opens a whole new range of allocation algorithms: ones that store allocate TNs in different locations within different portions of the program. This is because we can now represent a location being used to hold a certain TN within an arbitrary subset of the blocks the TN is referenced in.

Pack goals:

Pack should:

Subject to resource constraints: – Minimize use costs – “Register allocation” Allocate as many values as possible in scarce “good” locations, attempting to minimize the aggregate use cost for the entire program. – “Save optimization” Don’t allocate values in registers when the save/restore costs exceed the expected gain for keeping the value in a register. (Similar to “opening costs” in RAOC.) [Really just a case of representation selection.]

– Minimize preference costs Eliminate as many moves as possible.

“Register allocation” is basically an attempt to eliminate moves between registers and memory. “Save optimization” counterbalances “register allocation” to prevent it from becoming a pessimization, since saves can introduce register/memory moves.

Preference optimization reduces the number of moves within an SC. Doing a good job of honoring preferences is important to the success of the compiler, since we have assumed in many places that moves will usually be optimized away.

The scarcity-oriented aspect of “register allocation” is handled by a greedy algorithm in pack. We try to pack the “most important” TNs first, under the theory that earlier packing is more likely to succeed due to fewer constraints.

The drawback of greedy algorithms is their inability to look ahead. Packing a TN may mess up later “register allocation” by precluding packing of TNs that are individually “less important,” but more important in aggregate. Packing a TN may also prevent preferences from being honored.

Initial packing:

Pack all TNs restricted to a finite SC first, before packing any other TNs.

One might suppose that Pack would have to treat TNs in different environments differently, but this is not the case. Pack simply assigns TNs to locations so that no two conflicting TNs are in the same location. In the process of implementing call semantics in conflict analysis, we cause TNs in different environments not to conflict. In the case of passing TNs, cross environment conflicts do exist, but this reflects reality, since the passing TNs are live in both the caller and the callee. Environment semantics has already been implemented at this point.

This means that Pack can pack all TNs simultaneously, using one data structure to represent the conflicts for each location. So we have only one conflict set per SB location, rather than separating this information by environment.

Load TN packing:

We create load TNs as needed in a post-pass to the initial packing. After TNs are packed, it may be that some references to a TN will require it to be in a SC other than the one it was packed in. We create load-TNs and pack them on the fly during this post-pass.

What we do is have an optional SC restriction associated with TN-refs. If we pack the TN in an SC which is different from the required SC for the reference, then we create a TN for each such reference, and pack it into the required SC.

In many cases we will be able to pack the load TN with no hassle, but in general we may need to spill a TN that has already been packed. We choose a TN that isn’t in use by the offending VOP, and then spill that TN onto the stack for the duration of that VOP. If the VOP is a conditional, then we must insert a new block interposed before the branch target so that the TN value is restored regardless of which branch is taken.

Instead of remembering lifetime information from conflict analysis, we rederive it. We scan each block backward while keeping track of which locations have live TNs in them. When we find a reference that needs a load TN packed, we try to pack it in an unused location. If we can’t, we unpack the currently live TN with the lowest cost and force it into an unbounded SC.

The per-location and per-TN conflict information used by pack doesn’t need to be updated when we pack a load TN, since we are done using those data structures.

We also don’t need to create any TN-Refs for load TNs. [??? How do we keep track of load-tn lifetimes? It isn’t really that hard, I guess. We just remember which load TNs we created at each VOP, killing them when we pass the loading (or saving) step. This suggests we could flush the Refs thread if we were willing to sacrifice some flexibility in explicit temporary lifetimes. Flushing the Refs would make creating the VMR representation easier.]

The lifetime analysis done during load-TN packing doubles as a consistency check. If we see a read of a TN packed in a location which has a different TN currently live, then there is a packing bug. If any of the TNs recorded as being live at the block beginning are packed in a scarce SB, but aren’t current in that location, then we also have a problem.

The conflict structure for load TNs is fairly simple, the load TNs for arguments and results all conflict with each other, and don’t conflict with much else. We just try packing in targeted locations before trying at random.