Introduction
This page examines the earliest surviving Lisp compiler, written by Timothy Hart and Michael Levin around 1961. To avoid repetition of phrases like “the compiler”, we will use the abbreviation “HLC” (for “Hart–Levin Compiler”) to refer to it. The HLC was introduced in M.I.T. AIM-039 and documented more fully in the LISP 1.5 Programmer's Manual by McCarthy et al.
For information about the timeline of Lisp compilers, I refer you to the question “What was the first Lisp compiler” on the Retrocomputing StackExchange site; the answer there gives an excellent overview. We will be focusing on implementation details. It should be mentioned, however, that although the title of this page is “The First Lisp Compiler”, the HLC is really the second Lisp compiler. A more correct title would be “The First Good Lisp Compiler” or “The First Lisp Compiler Whose Source Code is Available”.
In order to understand the HLC, a decent amount of prequisite knowledge is required, about the machine LISP 1.5 ran on and about LISP 1.5 itself. If you aren't already familiar, I recommend reading the LISP 1.5 Programmer's Manual linked above and checking out Jack Harper's site on the IBM 7090. You might also be interested in a GitHub issue I wrote a little while ago.
The source
The source code of the HLC comes from an archive of CTSS source code, namely lines 278–592 of file ctss/lispset.job. Reading the code in its original form is quite difficult, since there is virtually no indentation or meaningful spacing. Occasionally atoms will start on one line/card and end on the next! And of course all-uppercase text is not known for being especially readable. Therefore I have prepared a formatted and cross-referenced version of the code for easy viewing.
Even though the code looks pretty, I still find it hard to follow. All destructuring is done with compositions of car and cdr. (I'm not criticizing the authors here, since it's not like LISP 1.5 gave them anything better to work with. But it feels like a step down when compared to Common Lisp, where we have fancy lambda lists and destructuring-bind.) An additional obscurity lies in the choices of function and variable names. For example, the name pi2 provides no clue as to its purpose (which is to construct a setq form). I think this is an assembly language convention.
To run the code, you could try simulating the LISP 1.5 system, or you can use my translation to Common Lisp. Note: This translation does the bare minimum to get it running. I make no attempt to use any high-level Common Lisp constructs; the aim was to leave the code as unchanged as possible except to add indentation. To compile the compiler, uncomment the lines at the bottom.
The source code of an earlier version of the HLC also exists, in a document titled “LISP LIBRARY NOVEMBER 1963”.
Fundamentals of the HLC
Here is how the implementers describe the HLC (from AIM-039):
… It is written entirely in LISP and is the first compiler that has ever compiled itself by being executed interpretively.
Another early effort that might also claim this honor is Donald Knuth's RUNCIBLE; Knuth described an algorithm to translate RUNCIBLE expressions into assembly language, and implemented it in RUNCIBLE itself.
The purpose of the LISP compiler is to replace S-expression definitions of functions with efficient machine language subroutines. A subroutine can be expected to run about 40 times as fast as the interpreter can execute the same function from its S-expression definition. Subroutines typically take 70–80 per cent of the storage required by their corresponding S-expressions.
Such a drastic speed-up was possible because the LISP 1.5 interpreter was not especially efficient.
The HLC was notable for several aspects:
-
It was the first compiler to attempt to transform a tail call into a transfer instruction that doesn't incur the function call overhead. However, it recognized only a certain case of tail calls.
-
It created a behavioral distinction between compiled and interpreted Lisp code that would be the norm until Common Lisp in 1984.
-
It did no control-flow or data-flow optimizations other than the tail call elimination. In particular, it did not attempt to avoid conversions to and from boxed representations of numeric values; all arithmetic was still done by function calls. No functions were open-coded.
-
It operated in two clearly separated passes.
Using the HLC
The main entry point to the HLC is the compile function, which takes a list of function names and compiles their definitions. In LISP 1.5, function definitions were stored on the property list of the function's name, under one of the indicators EXPR (for interpreted definitions), FEXPR (interpreted and accepting a variable number of unevaluated arguments), SUBR (implemented in machine code), or FSUBR (implemented in machine code and accepting a variable number of unevaluated arguments). The HLC replaces an EXPR or FEXPR property with a SUBR or FSUBR property.
Code acceptable to the interpreter is not necessarily acceptable to the HLC. For managing the values of variables, the interpreter used a simple association list; it had no concept of scope. On the other hand, compiled code avoids manipulation of an environment by storing values on the push-down list (AKA the stack). Thus the HLC effectively supported lexical scoping. The HLC still allowed variables with indefinite scope, but they must be explicitly declared as such by using the special function. If a special variable was also meant to be shared between compiled and interpreted functions, it must be declared in a different way, with the common function. (Declarations can be reversed by the unspecial and uncommon functions.) We will see later how the special/common variable mechanism worked internally.
The HLC generated code in the form of a symbolic listing and handed it to lap, the Lisp Assembly Program, which converted the listing into machine code and “installed” it in the computer's memory. The HLC also printed out the listing.
Pass one
Pass one of the HLC is encapsulated within the function passone. The goal of the first pass is to transform a Lisp expression into another Lisp expression that is potentially easier for the compiler to deal with. In some cases, the output of passone is not a valid Lisp expression, but rather a kind of annotation for the second pass to act on.
The main “worker” in pass one is paform, which is given any form that is to be evaluated. For instance, in a normal function call paform is mapped over all the arguments. Caution must be exercised with operators that don't follow ordinary evaluation rules: It would be quite incorrect to hand paform the subforms of a cond expression.
Tail call elimination
The first transformation applied converts a specific case of recursion into iteration, i.e., it performs (limited) tail call elimination. This is done by the progiter subroutine. Suppose that we are defining a function named f. Then progiter checks to see whether f's definition is of the form
(cond
|
In other words, it looks for a cond expression that has at least one clause whose consequent expression is a recursive call to the function being defined. If such an expression is found, the progiter1 function is called upon to do the actual transformation.
Four arguments are given to progiter1:
- g1, a fresh, uninterned symbol created by gensym. This will be used as a prog tag.
- g2, ditto.
- vs, the parameters of the function being defined.
- gs, a list of fresh symbols, one for each parameter.
The goal is to create a prog form, replacing recursive calls with go statements and replacing other cond clauses with return statements. Here is what we might get when member is run through progiter (the generated symbols use Common Lisp syntax):
(λ (u v) | |||||||||||||||||||||||||
|
I confess that I am unsure why the tag #:g479 (and the go thereto) is necessary. However, the extra variables for the parameters are essential to avoid problems with indefinite scope. For example, let's say that we have a function g defined as follows.
(λ (x y) | |||||||
|
If y is declared special, and h modifies y,
then the obvious expansion that directly sets x and y would be incorrect, since
y would be changed before (cdr y)
is
evaluated. (In Common Lisp,
psetq or
psetf
could be used to achieve the same effect.)
Special and common variables
To make the job of the second pass easier, pass one determines and marks all special and common variables. Specifically, the following changes are made:
-
A reference to a special variable x will be turned into
(special x)
. Thus the form(setq x y)
, where x and y have been declared special, becomes(setq (special x) (special y))
. Note that this code is not valid LISP 1.5; the compiler will deal with it “specially”. -
A reference to a common variable c will be turned into the expression
(eval (quote c) $alist)
. The effect is to invoke the interpreter to compute the value of c. Of course, this won't work with setq, so… -
A setq of a common variable c becomes a call to setc, with the variable's name quoted, as in
(setc (quote c) whatever)
. This setc function is obscure; it is not defined in the source code of the M.I.T. interpreter. It is not to be confused with the cset function, which is well documented. There is a definition of setc in the reference manual for a LISP 1.5 interpreter on the IBM 360, but it is not compatible with the setc here. -
Any form that creates a binding of special and/or common variables is radically altered so that the second pass will generate code to save the current values before the form is executed and to restore them afterwards. Let's look at an example first. Assume that x and y have been declared common, and that z has been declared special.
(lambda (x y z) (cond ( (eq x y) (prog (z) (setq z x) (return t))) ( t nil))) Here there are two forms that bind—the overall lambda expression (though arguably not a “form” in this context) and the prog. Pass one, via pa4, produces
(λ (x y z) (prog (#:g479) (combind (quote (x y)) (list x y)) (specbind (quote (z))) (setq #:g479 (cond ( (eq (eval (quote x) $alist) (eval (quote y) $alist)) (prog (#:g480 z) (specbind (quote (z))) (setq #:g480 (prog ( ) (setq (special z) (eval (quote x) $alist)) (return (quote *t*)) (return (quote nil)))) (specrstr (quote (z))) (return #:g480))) ( (quote *t*) (quote nil)))) (specrstr (quote (z))) (comrstr (quote 2)) (return #:g479))). (Note that this code showcases some pass one transformations that we have not encountered yet; see below. Also, generated symbols are expressed in Common Lisp syntax.)
The functions specbind, specrstr, combind, and comrstr are machine-coded routines that handle binding special and common variables. No code for them exists in the M.I.T. interpreter source code available online, unfortunately. However, corresponding routines can be found in other implementations, namely an interpreter on the IBM 360 and the (compiler-only) Lisp on the AN/FSQ-32.
More specifically, specbind takes a list of symbols whose current special bindings should be saved, and specrstr takes a list of symbols whose previous special binding should be restored. The combind function appears to be similar, although it takes a second argument, which is a list of values to which the common variables should be bound. Finally, the comrstr function takes the number of common variables bound by combind and restores their previous binding.
Other transformations
Here is the rest of what paform does.
- Self-evaluating forms (numbers, nil, and *t*) are quoted.
- The variables t and f are changed to their APVALs *t* and nil (quoted).
- A call to not changes into a call to null.
- A call to set changes into a call to setc.
- A csetq form is changed into a cset form, with the argument quoted.
- The FSUBRs select and conc are “expanded”—the former into a cond, the latter into a series of append calls.
- The subforms of list, cond, and setq forms are paform'd as necessary.
-
A prog form is treated for special/common binding and
has its non-atomic subforms
paform'd. The expression
(return (quote nil))
is added at the end, so that every prog has an explicit final return. - Calls on and and or are left alone, except that paform is applied to their subforms.
- Other FSUBRs, and all FEXPRs, are expanded to reflect the actual calling convention, where the underlying function is passed a list of the arguments and the current environment.
- “Funarg” expressions using function are converted into a call to a mysterious func function/operator. This does not appear to be documented anywhere, in any Lisp implementation, and is not defined in the M.I.T. interpreter. It is passed the name of a function and the current environment; if function is used with a λ expression, the expression is compiled and a name is generated for it.
Pass two
In the second pass, the Lisp code for a function, as processed by the first pass, is translated into assembly instructions to be executed by the IBM 704/709/7090/7094 (henceforth referred to as simply “the 7090”). As Bernard S. Greenberg says, in “The Multics MACLISP Compiler”,
The contract of a lisp compiler is to produce a machine-language program which, when executed, will have the same effect on the lisp environment, and return the same value, as having applied eval to the forms it was given to compile.
While the first pass is a grab-bag of assorted, special-purpose functions, the second pass is a bit more structured. Before we look at the translations of Lisp constructs, we need to be familiar with some variables and functions used throughout pass two.
The listing
The generated code is held in the listing variable, which is a list of LAP instructions. New instructions are added by passing a list to attach. Note that attach puts its argument at the front of listing; after compilation has concluded, listing is reversed.
Locating values
Local variables (parameters and prog variables) and compiler-generated temporaries are
stored on the Lisp system's push-down list or PDL. (Nowadays we would call it “the stack”.) In order to load
the value of such an entity, the compiler must know its position on the PDL.
This is accomplished using a variable called stomap
(“storage map”), and a
LAP symbol defined in each function's assembly output,
*N
.
In compiled code, IR1 always holds the address of the highest PDL cell used by this function. The instruction for loading a variable's value into AC normally looks like (in LAP format)
(CLA | (1 *N) 1). |
In LAP, the notation “(1 *N)
” means to add the constant
1
to the value of the symbol *N
.
For now it suffices to say that the value of *N
is the negative of the total number of
PDL cells needed for the function. Because of the way effective addresses are calculated on the 7090,
the above instruction ends up loading the second cell from the top of the PDL, which is the
first cell used for local storage. (When we discuss function prologues, we will uncover the true meaning of *N
.)
The stomap
special variable maps variable names to lists of the form
((m *N) 1)
, where m is an integer. For example,
the function
(λ (a b c) | |
|
Results in the stomap
(
|
The #:g1964
variable was created by the compiler to hold the result of the inner
cons
; the entry for nil
is always present and never used.
When the compiler wishes to generate an instruction that refers to a local quantity, it calls
locate, which searches in
stomap
for a given name and returns the entry's cdr
.
(It also handles “locating” quoted values, but this is trivial thanks to
LAP.)
For example, to generate the CLA above, we could call
Many functions bind stomap
, so it can gain and then lose entries; the variable
length
holds the greatest size of stomap
.
Besides locate, only
two functions access stomap
directly:
-
phase2 initializes it to
(quote ((nil (0 *n) 1)))
and also setslength
to 0. -
store is a routine that adds an entry for a variable whose name is given as the first argument. It takes a second argument, which, when true, causes a STO instruction to be emitted. This function takes care of updating
length
.
Finally, the utility routine lac should be mentioned. It has the simple task of generating a CLA instruction to load a variable of a given name into AC (unless it is already in the simulated accumulator; see below).
Function prologues and epilogues and linkage
Every compiled function begins with the following two instructions:
(TSX | *MOVE | 1 | ⟨number of arguments⟩) |
(TNX | (E ⟨name of this function⟩) | 1 | *MN) |
The effect is to set up the PDL, so that it contains the function's arguments and has room for any other local variables needed. Although simple enough in concept, the exact mechanism at play is rather annoyingly complex.
First of all, the second instruction isn't really an instruction at all. It's more like an argument to the
*MOVE
routine. The 7090 did not have a hardware stack, and passing arguments
by using additional “instructions” after a
TSX was one way to avoid the limitation of
keeping everything in the two general-purpose registers. (This sort of thing is nearly unheard of today. If you don't
know how it would have been done on the 7090, I recommend reading
Jack Harper's
page about subroutine linkage on the machine; otherwise, the ensuing discussion will probably be confusing.)
To understand what *MOVE
does, we have to know a bit about the LISP 1.5
system. The PDL grows upwards, from a location stored in the decrement part of a
TXI instruction labeled
CPPI
. When a compiled function is called, at least four cells of the
PDL are used.
Cell | Contents |
---|---|
pdl[0]: |
A TXL instruction with the return address in its decrement part, and the location of the present function in its address part. |
pdl[1]: | The contents of AC (normally the first argument). |
pdl[2]: | The contents of MQ (normally the second argument). |
pdl[3]: | First cell of local storage; possibly a third argument. |
⋮ | |
pdl[*MN] | The last cell of local storage. |
pdl[*MN + 1] | An STR instruction with *MN in its decrement part. |
As you can see, *MN is the total number of cells used for local storage, including the saved AC and MQ. Note that these registers are saved no matter what, even if the function takes fewer than two arguments. The value of *N is the negation of *MN.
The return address needs to be stored so that the function can return, naturally; the name is stored in order
to provide a backtrace in error messages. The
STR instruction is used when unwinding
the stack during such a backtrace or when something fails in an errorset
form.
Calling *MOVE
creates a PDL
entry in the above format. It also leaves a pointer to the top of the PDL
in IR1, as stated
above.
At the end of every compiled function is an instruction to load the return value into the accumulator, followed by
(TXI *RETURN 1 *MN) |
The *RETURN
routine restores the decrement part of
CPPI to its previous value, puts the return address in
IR4, and then transfers to it.
But the complexities don't end there. The code generated for a function call contains an
STR instruction.
On the 7090, STR causes the
computer to transfer unconditionally to location 2, in which a
TTR instruction causes a further
transfer to the LINK
routine. This routine invokes
apply
if the called function is an
EXPR or an FEXPR. If the called
function in a SUBR or an FSUBR,
however, the routine modifies the STR
to be a TSX
instruction that jumps directly to the function's address.
Compiling an expression
We now turn to the code generated for the various expressions in Lisp. The most important function is comval, which, as its name suggests, compiles an expression that produces a value. (It's the most important because all Lisp expressions produce values.) The instructions generated by comval leave the result of the expression in AC.
The name
parameter of
comval
holds a symbol that's supposed to be the name for the result. Except for the first call, when it's
nil
, this symbol is always created by
gensym
. Other functions use name
as well.
During compilation, the second pass models the contents of
AC during execution, using the special variable
ac
.
Once comval has
finished, it sets ac
to name
, since
comval puts
the value in AC.
Compiling atoms and variables
Atoms and variables take no effort. The result is a CLA instruction produced by lac, because LAP takes care of the dirty work.
Compiling assignments
The code generated for the expression
(setq x ⟨expression⟩) |
is
code for ⟨expression⟩ |
(STO ⟨location of x⟩), |
where ⟨location of x⟩
is the result of calling
locate
on x
.
Compiling function calls
The instruction that calls a normal compiled or interpreted function f
, generated by
call, is
(STR | (E f) | 7 | n), |
where n is the number of arguments. Before the STR, the compiler needs to set up the arguments, if the function requires them. The evaluation of the arguments is handled by comlis, which applies comval to the non-atomic non-constant subexpressions in the call. Each such subexpression is replaced by a temporary name, which is put on the stomap. Thus all arguments are symbols or constants, when call gets hold of them.
The first argument is passed in AC;
call checks
ac
to avoid an unnecessary CLA
instruction. The second argument is passed in MQ.
All other arguments are stored in locations $ALIST + 3, $ALIST + 4, etc. The process for each argument a at position i in the argument list has three cases:
-
If a is
nil
, the output is(STZ ($ALIST i)). -
If a is in the accumulator, the output is
(STO ($ALIST i)). -
Otherwise, the output is
(LDQ ⟨location of a⟩) (STQ ($ALIST i)).
Because MQ is used in the third case, the second argument must be reloaded afterwards. If the second argument is in AC because of comlis, an XCA instruction is generated. For example,
(f nil (g 2)) |
results in
(CLA | (quote 2)) | ; set up argument of g | ||
(STR | (E g) | 7 | 1) | ; call g |
; the result of g is now in the accumulator | ||||
(XCA) | ; but it's the second argument, so move it to MQ | |||
(CLA | (quote nil)) | ; put first argument of f in AC | ||
(STR | (E f) | 7 | 1) | ; call f |
In the common case where the second argument isn't already in the accumulator, the instruction generated is
(LDQ | ⟨result of calling locate on the second argument⟩). |
There seems to be an unreachable code path in call.
It checks whether the first two arguments are the same, and if that value is in the accumulator it generates instructions to
to store it in $ALIST + 2
and to copy it to MQ.
However, comlis always generates a unique symbol for
each subexpression and never lets ac contain a name that wasn't created by gensym. In other words, after
calling comlis, ac will be
a newly-interned symbol. Perhaps this is a remnant of an earlier version.
The function list, in compiled code, uses an odd calling convention (like *MOVE), where arguments are passed in instructions following a TSX. More specifically, the code
(list a1 … an) |
produces
(TSX | *LIST | 4) |
(16777216 × n | a1) | |
(0 | a2) | |
⋮ | ||
(0 | an) |
Note that 16777216 is 100000000 in octal. If the list call has no arguments then the compiler generates an instruction to load the constant nil.
The other functions with odd calling are specbind and specrstr, which were inserted into the input in pass one. They both use the same convention:
(specbind/specrstr (quote (s1 … sn))) |
compiles to
(TSX | specbind/specrstr | 4) | ||
(0 | (⟨offset of s1 on the PDL⟩ *N) | 1 | (SPECIAL s1)) | |
(0 | (⟨offset of s2 on the PDL⟩ *N) | 1 | (SPECIAL s2)) | |
⋮ | ||||
(STR | (⟨offset of sn on the PDL⟩ *N) | 1 | (SPECIAL sn)) |
In LISP 1.5, the car of an expression can be non-atomic, as in
(
|
The interpreter will evaluate ⟨expression1⟩ in an environment where x is bound to the result of evaluating ⟨expression2⟩. The compiler basically does the exact same thing, using a helper function named comply (perhaps a portmanteau of “compile” and “apply”). Each argument is compiled; its result is put on the stomap under the name of the corresponding parameter. Then the local function's body is compiled with the new stomap containing entries for the parameters.
Compiling prog forms
The “program feature” is the least Lispy part of Lisp, so it makes sense that its compilation is straightforward; the function responsible is comprog. First, comprog goes through the form, making a list that pairs tags with generated symbols, to avoid naming conflicts with any inner progs. This list is stored in the special variable golist, which is also used by comcond, a function we will discuss momentarily. The symbols in the golist turn into labels in the LAP output.
Code is generated to set the prog variables to 0 (nil), unless the variable was added in pass one to help save/restore the values of special variables.
The forms in the body of the prog are then comval'd, except that go forms are handled. Namely, the statement
(go x) |
becomes a simple
(TRA ⟨generated label corresponding to x⟩). |
Compiling return expressions is not done by comprog, but by call. A return form has two possible translations, depending on whether the prog in which it occurs is the main body of a function or a subexpression. In the former case, the output is
⟨code to load the return value into AC⟩ |
(TXI *RETURN 1 *MN). |
In the latter case, the output is
⟨code to load the return value into AC⟩ |
(TRA ⟨a label generated to mark the end of the prog⟩). |
The last thing comprog does is attach that generated label to the assembly output, if it exists.
Compiling conditionals
The routines responsible for compiling conditionals and Boolean expressions are comcond, combool, and compact, which have (more or less) the following purposes.
-
combool compiles an and or or form.
-
compact compiles a form used as a test in a conditional.
-
comcond compiles a cond form.
Understanding these routines is complicated somewhat by the fact that combool and compact call each other—but this is reasonable because and and or can be written in terms of cond.
The code generated for the expression
(and e1 e2 e3) |
is
⟨code to evaluate e1⟩ | ||||
(TZE | label) | |||
⟨code to evaluate e2⟩ | ||||
(TZE | label) | |||
⟨code to evaluate e3⟩ | ||||
(TZE | label) | |||
(CLA | (quote *t*)) | |||
(TRA | (* 2)) | |||
label | (CLA | (quote nil)). |
The code for
(or e1 e2 e3) |
is quite similar.
⟨code to evaluate e1⟩ | ||||
(TNZ | label) | |||
⟨code to evaluate e2⟩ | ||||
(TNZ | label) | |||
⟨code to evaluate e3⟩ | ||||
(TNZ | label) | |||
(CLA | (quote nil)) | |||
(TRA | (* 2)) | |||
label | (CLA | (quote *t*)) |
The label will have been generated by gensym. In real code, some other instructions will follow the translation, so the TRA is meaningful.
In the LISP 1.5 system, the atom nil happens to be stored at location 0. Thus the code can compare the AC with 0 to test for truth.
The translations of and and or are different when they occur in the test (or antecedent) part of a cond clause. In a cond, the TNZ and TZE instructions jump to a label generated by comcond, instead of the one generated in combool (which ends up in the assembly output but is never transfered to). Furthermore, the code doesn't load *t* or nil.
A cond form
(cond
|
results in output like
⟨code to evaluate test1⟩ | ||||
(TZE | false-label1) | |||
⟨code to evaluate consequent1⟩ | ||||
(TRA | end-conditional) | |||
false-label1 | ⟨code to evaluate test2⟩ | |||
(TZE | false-label2) | |||
⟨code to evaluate consequent2⟩ | ||||
(TRA | end-conditional) | |||
false-label2 | ⟨code to evaluate test3⟩ | |||
(TNZ | false-label3) | |||
⟨code to evaluate consequent2⟩ | ||||
(TRA | end-conditional) | |||
false-label2 | ⟨code to evaluate consequent4⟩ | |||
end-conditional | … . |
As you can see, the code for the final clause does not “test” t; instead it simply goes on to consequent4. A call to not is also avoided by using TNZ instead of TZE. Another clever optimization, not shown here, involves go. When a clause's consequent is a go expression, the condition is inverted (by wrapping the test in null) and one of compact's arguments is abused so that a successful test transfers to the go statement's target.
There remains only one loose end: A special case in compact, using a subroutine named ceq, compiles eq expressions into SUB instructions.