The First Lisp Compiler

By Asher Olsen

Contents

  1. Introduction
  2. The source
  3. Fundamentals of the HLC
  4. Using the HLC
  5. Pass one
    1. Tail call elimination
    2. Special and common variables
    3. Other transformations
  6. Pass two
    1. The listing
    2. Locating values
    3. Function prologues and epilogues and linkage
    4. Compiling an expression
      1. Compiling atoms and variables
      2. Compiling assignments
      3. Compiling function calls
      4. Compiling prog forms
      5. Compiling conditionals

Introduction

This page examines the earliest surviving Lisp compiler, written by Timothy Hart and Michael Levin around 1961. To avoid repetition of phrases like “the compiler”, we will use the abbreviation “HLC” (for “Hart–Levin Compiler”) to refer to it. The HLC was introduced in M.I.T. AIM-039 and documented more fully in the LISP 1.5 Programmer's Manual by McCarthy et al.

For information about the timeline of Lisp compilers, I refer you to the ques­tion “What was the first Lisp compiler” on the Retrocomputing Stack­Exchange site; the answer there gives an excellent overview. We will be focusing on implementation details. It should be mentioned, however, that al­though the title of this page is “The First Lisp Compiler”, the HLC is really the second Lisp compiler. A more correct title would be “The First Good Lisp Com­piler” or “The First Lisp Compiler Whose Source Code is Available”.

In order to understand the HLC, a decent amount of prequisite knowledge is required, about the machine LISP 1.5 ran on and about LISP 1.5 itself. If you aren't already familiar, I recommend reading the LISP 1.5 Programmer's Manual linked above and checking out Jack Harper's site on the IBM 7090. You might also be interested in a GitHub issue I wrote a little while ago.

The source

The source code of the HLC comes from an archive of CTSS source code, namely lines 278–592 of file ctss/lispset.job. Reading the code in its original form is quite difficult, since there is virtually no indentation or meaningful spacing. Occasionally atoms will start on one line/card and end on the next! And of course all-uppercase text is not known for being especially readable. Therefore I have prepared a formatted and cross-referenced version of the code for easy viewing.

Even though the code looks pretty, I still find it hard to follow. All destruc­turing is done with compositions of car and cdr. (I'm not criticizing the authors here, since it's not like LISP 1.5 gave them anything better to work with. But it feels like a step down when compared to Common Lisp, where we have fancy lambda lists and destructuring-bind.) An additional obscurity lies in the choices of function and variable names. For example, the name pi2 provides no clue as to its purpose (which is to construct a setq form). I think this is an assembly language convention.

To run the code, you could try simulating the LISP 1.5 system, or you can use my translation to Common Lisp. Note: This translation does the bare mini­mum to get it running. I make no attempt to use any high-level Common Lisp constructs; the aim was to leave the code as unchanged as possible except to add indentation. To compile the compiler, uncomment the lines at the bottom.

The source code of an earlier version of the HLC also exists, in a document titled “LISP LIBRARY NOVEMBER 1963”.

Fundamentals of the HLC

Here is how the implementers describe the HLC (from AIM-039):

… It is written entirely in LISP and is the first compiler that has ever compiled itself by being executed interpretively.

Another early effort that might also claim this honor is Donald Knuth's RUNCIBLE; Knuth described an algorithm to translate RUNCIBLE expres­sions into assembly language, and implemented it in RUNCIBLE itself.

The purpose of the LISP compiler is to replace S-expression def­initions of functions with efficient machine language subroutines. A subroutine can be expected to run about 40 times as fast as the interpreter can execute the same function from its S-expres­sion definition. Subroutines typically take 70–80 per cent of the storage required by their corresponding S-expressions.

Such a drastic speed-up was possible because the LISP 1.5 interpreter was not especially efficient.

The HLC was notable for several aspects:

Using the HLC

The main entry point to the HLC is the compile function, which takes a list of function names and compiles their definitions. In LISP 1.5, function definitions were stored on the property list of the function's name, under one of the indica­tors EXPR (for interpreted definitions), FEXPR (interpreted and accepting a variable number of unevaluated arguments), SUBR (implemented in machine code), or FSUBR (implemented in machine code and accepting a variable number of unevaluated arguments). The HLC replaces an EXPR or FEXPR property with a SUBR or FSUBR property.

Code acceptable to the interpreter is not necessarily acceptable to the HLC. For managing the values of variables, the interpreter used a simple association list; it had no concept of scope. On the other hand, compiled code avoids manip­ulation of an environment by storing values on the push-down list (AKA the stack). Thus the HLC effectively supported lexical scoping. The HLC still allowed variables with indefinite scope, but they must be explicitly declared as such by using the special function. If a special variable was also meant to be shared between compiled and interpreted functions, it must be declared in a different way, with the common function. (Declarations can be reversed by the unspecial and uncommon functions.) We will see later how the special/common variable mechanism worked internally.

The HLC generated code in the form of a symbolic listing and handed it to lap, the Lisp Assembly Program, which converted the listing into machine code and “installed” it in the computer's memory. The HLC also printed out the listing.

Pass one

Pass one of the HLC is encapsulated within the function passone. The goal of the first pass is to transform a Lisp expression into another Lisp expression that is potentially easier for the compiler to deal with. In some cases, the output of passone is not a valid Lisp expression, but rather a kind of annotation for the second pass to act on.

The main “worker” in pass one is paform, which is given any form that is to be evaluated. For instance, in a normal function call paform is mapped over all the arguments. Caution must be exercised with operators that don't follow ordi­nary evaluation rules: It would be quite incorrect to hand paform the subforms of a cond expression.

Tail call elimination

The first transformation applied converts a specific case of recursion into iteration, i.e., it performs (limited) tail call elimination. This is done by the progiter subroutine. Suppose that we are defining a function named f. Then progiter checks to see whether f's definition is of the form

(cond
(
(predicate)
(f …))
…).

In other words, it looks for a cond expression that has at least one clause whose consequent expression is a recursive call to the function being defined. If such an expression is found, the progiter1 function is called upon to do the actual transformation.

Four arguments are given to progiter1:

  1. g1, a fresh, uninterned symbol created by gensym. This will be used as a prog tag.
  2. g2, ditto.
  3. vs, the parameters of the function being defined.
  4. gs, a list of fresh symbols, one for each parameter.

The goal is to create a prog form, replacing recursive calls with go statements and replacing other cond clauses with return statements. Here is what we might get when member is run through progiter (the generated symbols use Common Lisp syntax):

(λ (u v)
(prog  (#:g480 #:g481)
#:g478 (cond
(
(null v)
(return nil))
(
(equal (car v) u)
(return t))
(
t
(go #:g482)))
#:g482 (setq #:g480 u)
(setq #:g481 (cdr v))
(go #:g479)
#:g479 (setq u #:g480)
(setq v #:g481)
(go #:g478))).

I confess that I am unsure why the tag #:g479 (and the go thereto) is necessary. However, the extra variables for the parameters are essential to avoid problems with indefinite scope. For example, let's say that we have a function g defined as follows.

(λ (x y)
(cond
(
(eq (car y) x)
nil)
(
t
(g (h x) (cdr y)))))

If y is declared special, and h modifies y, then the obvious expansion that directly sets x and y would be incorrect, since y would be changed before (cdr y) is evaluated. (In Common Lisp, psetq or psetf could be used to achieve the same effect.)

Special and common variables

To make the job of the second pass easier, pass one determines and marks all special and common variables. Specifically, the following changes are made:

  • A reference to a special variable x will be turned into (special x). Thus the form (setq x y), where x and y have been declared special, becomes (setq (special x) (special y)). Note that this code is not valid LISP 1.5; the compiler will deal with it “specially”.

  • A reference to a common variable c will be turned into the expression (eval (quote c) $alist). The effect is to invoke the interpreter to compute the value of c. Of course, this won't work with setq, so…

  • A setq of a common variable c becomes a call to setc, with the variable's name quoted, as in (setc (quote c) whatever). This setc function is obscure; it is not defined in the source code of the M.I.T. interpreter. It is not to be confused with the cset function, which is well documented. There is a definition of setc in the reference manual for a LISP 1.5 interpreter on the IBM 360, but it is not compatible with the setc here.

  • Any form that creates a binding of special and/or common variables is radically altered so that the second pass will generate code to save the current values before the form is executed and to restore them after­wards. Let's look at an example first. Assume that x and y have been declared common, and that z has been declared special.

    (lambda (x y z)
    (cond
    (
    (eq x y)
    (prog
    (z)
    (setq z x)
    (return t)))
    (
    t
    nil)))

    Here there are two forms that bind—the overall lambda expression (though arguably not a “form” in this context) and the prog. Pass one, via pa4, produces

    (λ (x y z)
    (prog  (#:g479)
    (combind (quote (x y)) (list x y))
    (specbind (quote (z)))
    (setq #:g479
    (cond
    (
    (eq
    (eval (quote x) $alist)
    (eval (quote y) $alist))
    (prog  (#:g480 z)
    (specbind (quote (z)))
    (setq #:g480
    (prog  ( )
    (setq (special z) (eval (quote x) $alist))
    (return (quote *t*))
    (return (quote nil))))
    (specrstr (quote (z)))
    (return #:g480)))
    (
    (quote *t*)
    (quote nil))))
    (specrstr (quote (z)))
    (comrstr (quote 2))
    (return #:g479))).

    (Note that this code showcases some pass one transformations that we have not encountered yet; see below. Also, generated symbols are ex­pressed in Common Lisp syntax.)

  • The functions specbind, specrstr, combind, and comrstr are machine-coded routines that handle binding special and common variables. No code for them exists in the M.I.T. interpreter source code available online, unfor­tunately. However, corresponding routines can be found in other imple­mentations, namely an interpreter on the IBM 360 and the (compiler-only) Lisp on the AN/FSQ-32.

    More specifically, specbind takes a list of symbols whose current special bindings should be saved, and specrstr takes a list of symbols whose previous special binding should be restored. The combind function ap­pears to be similar, although it takes a second argument, which is a list of values to which the common variables should be bound. Finally, the comrstr function takes the number of common variables bound by combind and restores their previous binding.

Other transformations

Here is the rest of what paform does.

  • Self-evaluating forms (numbers, nil, and *t*) are quoted.
  • The variables t and f are changed to their APVALs *t* and nil (quoted).
  • A call to not changes into a call to null.
  • A call to set changes into a call to setc.
  • A csetq form is changed into a cset form, with the argument quoted.
  • The FSUBRs select and conc are “expanded”—the former into a cond, the latter into a series of append calls.
  • The subforms of list, cond, and setq forms are paform'd as necessary.
  • A prog form is treated for special/common binding and has its non-atomic subforms paform'd. The expression (return (quote nil)) is added at the end, so that every prog has an explicit final return.
  • Calls on and and or are left alone, except that paform is applied to their subforms.
  • Other FSUBRs, and all FEXPRs, are expanded to reflect the actual calling convention, where the underlying function is passed a list of the arguments and the current environment.
  • “Funarg” expressions using function are converted into a call to a myst­erious func function/operator. This does not appear to be docu­mented anywhere, in any Lisp implementation, and is not defined in the M.I.T. interpreter. It is passed the name of a function and the current environ­ment; if function is used with a λ expression, the expression is compiled and a name is generated for it.

Pass two

In the second pass, the Lisp code for a function, as processed by the first pass, is translated into assembly instructions to be executed by the IBM 704/709/7090/7094 (henceforth referred to as simply “the 7090”). As Bernard S. Greenberg says, in “The Multics MACLISP Compiler”,

The contract of a lisp compiler is to produce a machine-language program which, when executed, will have the same effect on the lisp environment, and return the same value, as having applied eval to the forms it was given to compile.

While the first pass is a grab-bag of assorted, special-purpose functions, the second pass is a bit more structured. Before we look at the translations of Lisp constructs, we need to be familiar with some variables and functions used throughout pass two.

The listing

The generated code is held in the listing variable, which is a list of LAP instructions. New instructions are added by passing a list to attach. Note that attach puts its argument at the front of listing; after compilation has concluded, listing is reversed.

Locating values

Local variables (parameters and prog variables) and compiler-generated temporaries are stored on the Lisp system's push-down list or PDL. (Nowadays we would call it “the stack”.) In order to load the value of such an entity, the compiler must know its position on the PDL. This is accomplished using a variable called stomap (“storage map”), and a LAP symbol defined in each function's assembly output, *N.

In compiled code, IR1 always holds the address of the highest PDL cell used by this function. The instruction for loading a variable's value into AC normally looks like (in LAP format)

(CLA (1 *N) 1).

In LAP, the notation “(1 *N)” means to add the constant 1 to the value of the symbol *N. For now it suffices to say that the value of *N is the negative of the total number of PDL cells needed for the function. Because of the way effective addresses are calculated on the 7090, the above instruction ends up loading the second cell from the top of the PDL, which is the first cell used for local storage. (When we discuss function prologues, we will uncover the true meaning of *N.)

The stomap special variable maps variable names to lists of the form ((m *N1), where m is an integer. For example, the function

(λ (a b c)
(cons a (cons c b)))

Results in the stomap

(
(#:g1964 .  ((4 *N) 1))
(c . ((3 *N) 1))
(b . ((2 *N) 1))
(a . ((1 *N) 1))
(nil . ((0 *N) 1)))

The #:g1964 variable was created by the compiler to hold the result of the inner cons; the entry for nil is always present and never used.

When the compiler wishes to generate an instruction that refers to a local quantity, it calls locate, which searches in stomap for a given name and returns the entry's cdr. (It also handles “locating” quoted values, but this is trivial thanks to LAP.) For example, to generate the CLA above, we could call

(attach (list (cons (quote cla) (locate (quote a)))))

Many functions bind stomap, so it can gain and then lose entries; the variable length holds the greatest size of stomap.

Besides locate, only two functions access stomap directly:

  • phase2 initializes it to (quote ((nil (0 *n) 1))) and also sets length to 0.

  • store is a routine that adds an entry for a variable whose name is given as the first argument. It takes a second argument, which, when true, causes a STO instruction to be emitted. This function takes care of updating length.

Finally, the utility routine lac should be mentioned. It has the simple task of generating a CLA instruction to load a variable of a given name into AC (unless it is already in the simulated accumulator; see below).

Function prologues and epilogues and linkage

Every compiled function begins with the following two instructions:

(TSX *MOVE 1 ⟨number of arguments⟩)
(TNX (E ⟨name of this function⟩) 1 *MN)

The effect is to set up the PDL, so that it contains the function's arguments and has room for any other local variables needed. Although simple enough in concept, the exact mechanism at play is rather annoyingly complex.

First of all, the second instruction isn't really an instruction at all. It's more like an argument to the *MOVE routine. The 7090 did not have a hardware stack, and passing arguments by using additional “instructions” after a TSX was one way to avoid the limitation of keeping everything in the two general-purpose registers. (This sort of thing is nearly unheard of today. If you don't know how it would have been done on the 7090, I recommend reading Jack Harper's page about subroutine linkage on the machine; otherwise, the ensuing discussion will probably be confusing.)

To understand what *MOVE does, we have to know a bit about the LISP 1.5 system. The PDL grows upwards, from a location stored in the decrement part of a TXI instruction labeled CPPI. When a compiled function is called, at least four cells of the PDL are used.

Cell Contents
pdl[0]:

A TXL instruction with the return address in its decrement part, and the location of the present function in its address part.

pdl[1]: The contents of AC (normally the first argument).
pdl[2]: The contents of MQ (normally the second argument).
pdl[3]: First cell of local storage; possibly a third argument.
pdl[*MN] The last cell of local storage.
pdl[*MN + 1] An STR instruction with *MN in its decrement part.

As you can see, *MN is the total number of cells used for local storage, including the saved AC and MQ. Note that these registers are saved no matter what, even if the function takes fewer than two arguments. The value of *N is the negation of *MN.

The return address needs to be stored so that the function can return, naturally; the name is stored in order to provide a backtrace in error messages. The STR instruction is used when unwinding the stack during such a backtrace or when something fails in an errorset form.

Calling *MOVE creates a PDL entry in the above format. It also leaves a pointer to the top of the PDL in IR1, as stated above.

At the end of every compiled function is an instruction to load the return value into the accumulator, followed by

(TXI *RETURN 1 *MN)

The *RETURN routine restores the decrement part of CPPI to its previous value, puts the return address in IR4, and then transfers to it.

But the complexities don't end there. The code generated for a function call contains an STR instruction. On the 7090, STR causes the computer to transfer unconditionally to location 2, in which a TTR instruction causes a further transfer to the LINK routine. This routine invokes apply if the called function is an EXPR or an FEXPR. If the called function in a SUBR or an FSUBR, however, the routine modifies the STR to be a TSX instruction that jumps directly to the function's address.

Compiling an expression

We now turn to the code generated for the various expressions in Lisp. The most important function is comval, which, as its name suggests, compiles an expression that produces a value. (It's the most important because all Lisp expressions produce values.) The instructions generated by comval leave the result of the expression in AC.

The name parameter of comval holds a symbol that's supposed to be the name for the result. Except for the first call, when it's nil, this symbol is always created by gensym. Other functions use name as well.

During compilation, the second pass models the contents of AC during execution, using the special variable ac. Once comval has finished, it sets ac to name, since comval puts the value in AC.

Compiling atoms and variables

Atoms and variables take no effort. The result is a CLA instruction produced by lac, because LAP takes care of the dirty work.

Compiling assignments

The code generated for the expression

(setq x ⟨expression⟩)

is

code for ⟨expression⟩
(STO ⟨location of x⟩),

where ⟨location of x is the result of calling locate on x.

Compiling function calls

The instruction that calls a normal compiled or interpreted function f, generated by call, is

(STR (E f) 7 n),

where n is the number of arguments. Before the STR, the compiler needs to set up the arguments, if the function requires them. The evaluation of the arguments is handled by comlis, which applies comval to the non-atomic non-constant subexpressions in the call. Each such subexpression is replaced by a temporary name, which is put on the stomap. Thus all arguments are symbols or constants, when call gets hold of them.

The first argument is passed in AC; call checks ac to avoid an unnecessary CLA instruction. The second argument is passed in MQ.

All other arguments are stored in locations $ALIST + 3, $ALIST + 4, etc. The process for each argument a at position i in the argument list has three cases:

  1. If a is nil, the output is

    (STZ ($ALIST i)).
  2. If a is in the accumulator, the output is

    (STO ($ALIST i)).
  3. Otherwise, the output is

    (LDQ ⟨location of a⟩)
    (STQ ($ALIST i)).

Because MQ is used in the third case, the second argument must be reloaded afterwards. If the second argument is in AC because of comlis, an XCA instruction is generated. For example,

(f nil (g 2))

results in

(CLA (quote 2)) ; set up argument of g
(STR (E g) 7 1) ; call g
; the result of g is now in the accumulator
(XCA) ; but it's the second argument, so move it to MQ
(CLA (quote nil)) ; put first argument of f in AC
(STR (E f) 7 1) ; call f

In the common case where the second argument isn't already in the accumulator, the instruction generated is

(LDQ ⟨result of calling locate on the second argument⟩).

There seems to be an unreachable code path in call. It checks whether the first two arguments are the same, and if that value is in the accumulator it generates instructions to to store it in $ALIST + 2 and to copy it to MQ. However, comlis always generates a unique symbol for each subexpression and never lets ac contain a name that wasn't created by gensym. In other words, after calling comlis, ac will be a newly-interned symbol. Perhaps this is a remnant of an earlier version.

The function list, in compiled code, uses an odd calling convention (like *MOVE), where arguments are passed in instructions following a TSX. More specifically, the code

(list a1an)

produces

(TSX *LIST 4)
(16777216 × n a1)
(0 a2)
(0 an)

Note that 16777216 is 100000000 in octal. If the list call has no arguments then the compiler generates an instruction to load the constant nil.

The other functions with odd calling are specbind and specrstr, which were inserted into the input in pass one. They both use the same convention:

(specbind/specrstr (quote (s1sn)))

compiles to

(TSX specbind/specrstr 4)
(0 (⟨offset of s1 on the PDL*N) 1 (SPECIAL s1))
(0 (⟨offset of s2 on the PDL*N) 1 (SPECIAL s2))
(STR (⟨offset of sn on the PDL*N) 1 (SPECIAL sn))

In LISP 1.5, the car of an expression can be non-atomic, as in

(
(λ (x) ⟨expression1⟩)
⟨expression2⟩).

The interpreter will evaluate ⟨expression1⟩ in an environment where x is bound to the result of evaluating ⟨expression2⟩. The compiler basically does the exact same thing, using a helper function named comply (perhaps a portmanteau of “compile” and “apply”). Each argument is compiled; its result is put on the stomap under the name of the corresponding parameter. Then the local function's body is compiled with the new stomap containing entries for the parameters.

Compiling prog forms

The “program feature” is the least Lispy part of Lisp, so it makes sense that its compilation is straightforward; the function responsible is comprog. First, comprog goes through the form, making a list that pairs tags with generated symbols, to avoid naming conflicts with any inner progs. This list is stored in the special variable golist, which is also used by comcond, a function we will discuss momentarily. The symbols in the golist turn into labels in the LAP output.

Code is generated to set the prog variables to 0 (nil), unless the variable was added in pass one to help save/restore the values of special variables.

The forms in the body of the prog are then comval'd, except that go forms are handled. Namely, the statement

(go x)

becomes a simple

(TRA ⟨generated label corresponding to x⟩).

Compiling return expressions is not done by comprog, but by call. A return form has two possible translations, depending on whether the prog in which it occurs is the main body of a function or a subexpression. In the former case, the output is

⟨code to load the return value into AC
(TXI *RETURN 1 *MN).

In the latter case, the output is

⟨code to load the return value into AC
(TRA ⟨a label generated to mark the end of the prog⟩).

The last thing comprog does is attach that generated label to the assembly output, if it exists.

Compiling conditionals

The routines responsible for compiling conditionals and Boolean expressions are comcond, combool, and compact, which have (more or less) the following purposes.

  • combool compiles an and or or form.

  • compact compiles a form used as a test in a conditional.

  • comcond compiles a cond form.

Understanding these routines is complicated somewhat by the fact that combool and compact call each other—but this is reasonable because and and or can be written in terms of cond.

The code generated for the expression

(and e1 e2 e3)

is

⟨code to evaluate e1
(TZE label)
⟨code to evaluate e2
(TZE label)
⟨code to evaluate e3
(TZE label)
(CLA (quote *t*))
(TRA (* 2))
label (CLA (quote nil)).

The code for

(or e1 e2 e3)

is quite similar.

⟨code to evaluate e1
(TNZ label)
⟨code to evaluate e2
(TNZ label)
⟨code to evaluate e3
(TNZ label)
(CLA (quote nil))
(TRA (* 2))
label (CLA (quote *t*))

The label will have been generated by gensym. In real code, some other instructions will follow the translation, so the TRA is meaningful.

In the LISP 1.5 system, the atom nil happens to be stored at location 0. Thus the code can compare the AC with 0 to test for truth.

The translations of and and or are different when they occur in the test (or antecedent) part of a cond clause. In a cond, the TNZ and TZE instructions jump to a label generated by comcond, instead of the one generated in combool (which ends up in the assembly output but is never transfered to). Furthermore, the code doesn't load *t* or nil.

A cond form

(cond
(test1 consequent1)
(test2 consequent2)
((not test3 consequent3))
(t consequent4))

results in output like

⟨code to evaluate test1
(TZE false-label1)
⟨code to evaluate consequent1
(TRA end-conditional)
false-label1 ⟨code to evaluate test2
(TZE false-label2)
⟨code to evaluate consequent2
(TRA end-conditional)
false-label2 ⟨code to evaluate test3
(TNZ false-label3)
⟨code to evaluate consequent2
(TRA end-conditional)
false-label2 ⟨code to evaluate consequent4
end-conditional … .

As you can see, the code for the final clause does not “test” t; instead it simply goes on to consequent4. A call to not is also avoided by using TNZ instead of TZE. Another clever optimization, not shown here, involves go. When a clause's consequent is a go expression, the condition is inverted (by wrapping the test in null) and one of compact's arguments is abused so that a successful test transfers to the go statement's target.

There remains only one loose end: A special case in compact, using a subroutine named ceq, compiles eq expressions into SUB instructions.