Documentation: Reading disassembly output

The Common Lisp function DISASSEMBLE produces an implementation-specific description of the internal representation of a function. Its output is very useful when you want to improve the performance of your code, since it gives you an idea of what the machine will do.

Disassembly of byte-compiled code

For example, if you use CMUCL's byte-code compiler, the DISASSEMBLE function will show you the bytecodes generated for your function:

  CL-USER> (defun foo (x y)
              (logand (logior x y) (logxor x y)))
  CL-USER> (let ((c::*byte-compile* t)) (compile 'foo))
  Byte Compiling lambda (x y):
  CL-USER> (disassemble 'foo)
   0: 00    Entry point, frame-size=0
   1: 20    push-const #<FDEFINITION object for kernel:two-arg-and>
   2: 21    push-const #<FDEFINITION object for kernel:two-arg-ior>
   3: 11    push-arg 1
   4: 10    push-arg 0
   5: 8A    named-call, 2 args
   6: 22    push-const #<FDEFINITION object for kernel:two-arg-xor>
   7: 11    push-arg 1
   8: 10    push-arg 0
   9: 8A    named-call, 2 args
  10: 9A    named-tail-call, 2 args
  11: 00    push-local 0
  12: 00    push-local 0
  13: 00    push-local 0
  14: 00    push-local 0
  15: 00    push-local 0

This is fairly easy to interpret, once you know that the byte-code interpreter is a stack machine. The references of the internal TWO-ARG-AND and TWO-ARG-IOR functions are pushed onto the stack, the arguments Y then X are pushed onto the stack, the TWO-ARG-IOR call is executed (leaving the result on the stack), the reference of the function TWO-ARG-XOR is pushed onto the stack, the arguments are pushed again, the TWO-ARG-XOR is executed, and finally the TWO-ARG-AND is executed in a tail-call.

Disassembly of native-compiled code

You are more likely to be interested in disassembling functions that have been compiled to native code. Understanding disassembly of native code is easier if you understand what certain common sequences of instructions are doing.

The following function illustrates CMUCL's ability to use unboxed arithmetic on values of type (unsigned-byte 32) and (signed-byte 32). It can also do similar things with floating point values. This significantly improves performance by reducing consing, and by allowing the compiler to emit instructions that operate directly on machine word sized values.

This optimization is easiest to implement inside a single function. It is more difficult to apply when making a function call, since it is necessary to ensure that the called function is expecting to receive unboxed arguments, rather than the normal calling convention. CMUCL is able to perform this type of optimization across function call boundaries when functions are declared to be inline, with locally defined functions (using FLET or LABELS), and when block compilation is used. See the CMUCL User's Manual for more information on this.

  (defun foo (x y)
    (declare (optimize (speed 3) (space 0) (safety 0) (debug 0))
             (type (unsigned-byte 32) x y))
    (logand (logior x y) (logxor x y)))

  USER> (compile 'foo)
  Compiling LAMBDA (X Y):

  In: LAMBDA (X Y)
    #'(LAMBDA (X Y)
        (DECLARE (OPTIMIZE # # # #) (TYPE # X Y))
        (BLOCK FOO (LOGAND # #)))
  Note: Doing unsigned word to integer coercion (cost 20) to "<return value>".
  Compilation unit finished.
    1 note

The x86 dissassembly of the function follows below. The first line is the entry point of the function, at address 0x481E0770. This is followed by a standard function prologue, which checks the number of arguments and their types, and unboxes them (the SAR operations at labels L0 and L2). Starting at label L3 are the logical operations (OR, XOR, AND), which operate directly on the 32 bit values. Since we must box/tag the return value for functions that follow the normal calling convention, we test whether any of the upper 3 bits of the result are set (the TEST ECX, 3758096384 operation, where the magic number is just #xE0000000). If that's true, we create a boxed bignum representation (sequence at label L5). Otherwise we can just tag the result as a fixnum with a two-bit 0 tag (and a leading 0 sign bit, of course), using LEA EDX, [ECX*4].

481E0770:       .ENTRY FOO()                 ; FUNCTION
     788:       POP   DWORD PTR [EBP-8]
     78B:       LEA   ESP, [EBP-32]
     78E:       MOV   EAX, EDX
     790:       TEST  AL, 3
     792:       JEQ   L0
     794:       MOV   ECX, [EAX-3]
     797:       JMP   L1
     799: L0:   SAR   EAX, 2
     79C:       MOV   ECX, EAX
     79E: L1:   MOV   EAX, EDI
     7A0:       TEST  AL, 3
     7A2:       JEQ   L2
     7A4:       MOV   EAX, [EAX-3]
     7A7:       JMP   L3
     7A9: L2:   SAR   EAX, 2
     7AC: L3:   MOV   EDX, ECX               ; No-arg-parsing entry point
     7AE:       OR    EDX, EAX
     7B0:       MOV   EBX, ECX
     7B2:       XOR   EBX, EAX
     7B4:       MOV   ECX, EDX
     7B6:       AND   ECX, EBX
     7B8:       TEST  ECX, 3758096384
     7BE:       JNE   L5
     7C0:       LEA   EDX, [ECX*4]
     7C7: L4:   MOV   ECX, [EBP-8]
     7CA:       MOV   EAX, [EBP-4]
     7CD:       ADD   ECX, 2
     7D0:       MOV   ESP, EBP
     7D2:       MOV   EBP, EAX
     7D4:       JMP   ECX
     7D6:       NOP
     7D7:       NOP
     7D8: L5:   JNS   L6
     7DA:       MOV   EDX, 522
     7DF:       JMP   L7
     7E1: L6:   MOV   EDX, 266
     7E6: L7:   MOV   BYTE PTR [#x280001D4], 0 ; COMMON-LISP::*PSEUDO-ATOMIC-INTERRUPTED*
     7ED:       MOV   BYTE PTR [#x280001BC], 4 ; COMMON-LISP::*PSEUDO-ATOMIC-ATOMIC*
     7F4:       MOV   EAX, 16
     7F9:       ADD   EAX, [#x8069724]       ; current_region_free_pointer
     7FF:       CMP   EAX, [#x80696F8]       ; current_region_end_addr
     805:       JBE   L8
     807:       CALL  #x80536F8              ; alloc_overflow_eax
     80C: L8:   XCHG  EAX, [#x8069724]       ; current_region_free_pointer
     812:       MOV   [EAX], EDX
     814:       LEA   EDX, [EAX+7]
     817:       MOV   [EDX-3], ECX
     81A:       MOV   BYTE PTR [#x280001BC], 0 ; COMMON-LISP::*PSEUDO-ATOMIC-ATOMIC*
     821:       CMP   BYTE PTR [#x280001D4], 0 ; COMMON-LISP::*PSEUDO-ATOMIC-INTERRUPTED*
     828:       JEQ   L9
     82A:       BREAK 9                      ; Pending interrupt trap
     82C: L9:   JMP   L4

The compiler can generate fast calls to the no-arg-parsing entry point, that bypass the checking of the number of arguments and their types, assuming that code has been compiled with low safety. This is best explained in the section Block Compilation in the CMU User's Manual.

The code at label L4 is the normal function epilogue. This basically restores the frame information saved in the function prologue, which is the code before label L0. The prologue basically allocates some space on the stack and adjust the frame pointer accordingly.

The code at label L5 checks first whether the most-significant bit is set. If it is, we have to alloc two words for the bignum, since a full 32 bits, plust the now required sign bit will not fit inside a single machine word. If we only have 31 bits worth of integer, we can fit it and the leading sign bit inside a single machine word. Since we also need a word for the header, we thus allocate either 2 or 4 words of storage, (the heap is 8 byte aligned).

The code at labels L5 to L6 setting the header word needed for boxing up the result when the result is a bignum.. The header tag is computed in EDX which is either 20A or 10A depending on whether the most significant bit is set or not. The final result is a bignum and bignums use a twos-complement representation so two wrods are needed if the most significant bit of the result is 1. This is all denoted by the 0A part which says the result is a bignum. The other part indicates how many words are needed.

The code at labels L7 to L8 deals with allocating space for the the boxed result. if there's enough space between the current_region_free_pointer and current_region_end_addr to hold 16 more bytes, then the free pointer is incremented. Otherwise we need to call alloc_overflow_eax to add more space.

Finally, the code after L8 fills in the allocated space with the header word value and the actual 32-bit result.

While this example doesn't show it, CMUCL can handle unboxed values on the stack, so even when intermediate or argument values spill from registers to the stack, they can remain unboxed. On x86 platforms, this is handled using a conversative garbage collector. On other platforms, this is done by using separate unboxed stacks, on platforms that segregate unboxed and boxed values in stacks and the register file.

Acknowledgment: This description is based on a USENET article <m2fzxapiug.fsf@ibook.bln.pmsf.net> on comp.lang.lisp, dated 2002-08-20, by Pierre Mai.