Common lisp best practices for splitting code between files - compilation

I'm moderately new to common lisp, but have extended experience with other 'separate compilation' languages (think C/C++/FORTRAN and such)
I know how to do an ASDF system definition. I know how to separate stuff in packages. I'm using SBCL, by the way.
The question is this: what's the best practice for splitting code (large packages) between .lisp files? I mean, in C there are include files, while lisp lives with the current image state. So with multiple files I need to handle dependencies or serial order in the system definition. But without something like forward declarations it's painful.
Simple example on what I want to do: I have, for example, two defstructs that are part of the same bigger data structure (like struct1 is a parent of some set of struct2). Some functions works on one, some other works on the other and some other use both.
So I would have: a packages.lisp, a fun1.lisp (with the first defstruct and related functions), a fun2.lisp (with the other defstruct and functions) and a funmix.lisp (with functions that use both). In an ideal world everything is sealed and compiling these in this order would be fine. As most of you know, this in practice almost never happen.
If I need to use struct2 functions from the struct1 ones I would need to either reorder or add a dependency. But then if there's some kind of back call (that can't be done with a closure) I would have struct1.lisp depending on struct2.lisp and vice-versa which is obviously not valid. So what? I could break the loop putting the defstruct in a separate file (say, structs.lisp) but what if either of the struct's function need to access the common functions in the third file? I would like to avoid style notes.
What's the common way to solve this, i.e. keeping loosely related code in the same file but still be able to interface to other ones. Is the correct solution to seal everything in a compilation unit (a single file)? use a package for every file with exports?

Lisp dependencies are simple, because in many cases, a Lisp implementation doesn't need to process the definition of something in order to compile its use.
Some exceptions to the rule are:
Macros: macros must be loaded in order to be expanded. There is a compile-time dependency between a file which uses macro and the file which defines them.
Packages: a package foo must be defined in order to use symbols like foo:bar or foo::priv. If foo is defined by a defpackage form in some foo.lisp file, then that file has to be loaded (either in source or compiled form).
Constants: constants defined with defconstant should be seen before their use. Similar remarks apply to inline functions, compiler macros.
Any custom things in a "domain specific language" which enforces definition before use. E.g. if Whizbang Inference Engine needs rules to be defined when uses of the rules are compiled, you have to arrange for that.
For certain diagnostics to be suppressed like calls to undefined functions, the defining and using files must be taken to be as a single compilation unit. (See below.)
All the above remarks also have implications for incremental recompilation.
When there is dependency like the above between files so that one is a prerequisite of the other, when the prerequisite is touched, the dependent one must be recompiled.
How to split code into files is going to be influenced by all the usual things: cohesion, coupling and what have you. Common-Lisp-specific reasons to keep certain things together in one file is inlining. The call to a function which is in the same file as the caller may be inlined. If your program supports any in-service upgrade, the granularity of code loading is individual files. If some functions foo and bar should be independently redefinable, don't put them in the same file.
Now about compilation units. Suppose you have a file foo.lisp which defines a function called foo and bar.lisp which calls (foo). If you just compile bar.lisp, you will likely get a warning that an undefined function foo has been called. You could compile foo.lisp first and then load it, and then compile bar.lisp. But that will not work if there is a circular reference between the two: say foo.lisp also calls (bar) which bar.lisp defines.
In Common Lisp, you can defer such warnings to the end of a compilation unit, and what defines a compilation unit isn't a single file, but a dynamic scope established by a macro called with-compilation-unit. Simply put, if we do this:
(with-compilation-unit
(compile-file "foo.lisp") ;; contains (defun foo () (bar))
(compile-file "bar.lisp")) ;; contains (defun bar () (foo))
If a compile-file isn't surrounded by with-compilation-unit then there is a compilation unit spanning that file. Otherwise, the outermost nesting of the with-compilation-unit macro determines the scope of what is in the compilation unit.
Warnings about undefined functions (and such) are deferred to the end of the compilation unit. So by putting foo.lisp and bar.lisp compilation into one unit, we suppress the warnings about either foo or bar not being defined and we can compile the two in any order.
Build systems use with-compilation-unit under the hood, as appropriate.
The compilation unit isn't about dependencies but diagnostics. Above, we don't have a compile time dependency. If we touch foo.lisp, bar.lisp doesn't have to be recompiled or vice versa.
By and large, Lisp codebases don't have a lot of hard dependencies among the files. Incremental compilation often means that just the affected files that were changed have to be recompiled. The C or C++ problem that everything has to be rebuilt because a core header file was touched is essentially nonexistent.

but what if
No matter how you first organize your code, if you change it significantly you are going to have to refactor. IMO there is no ideal way of grouping dependencies in advance.
As a rule of thumb it is generally safe to define generic functions first, then types, then actual methods, for example. For non-generic functions, you can cut circular dependencies by adding forward declarations:
(declaim (ftype function ...))
Having too much circular dependency is a bit of a code smell.
Is the correct solution to seal everything in a compilation unit
Yes, if you group the definitions in the same compilation unit (the same file), the file compiler will be able to silence the style notes until it reaches the end of file: at this point it knows if there are still missing references or if all the cross-references are resolved.
But then if there's some kind of back call (that can't be done with a closure)
If you have a specific example in mind please share, but typically you can define struct1 and its functions in a way that can be self-contained; maybe it can accept a map that binds event names to callbacks:
(make-struct-1 :callbacks (list :on-empty one-is-empty
:on-full one-is-full))
Similarly, struct2 can accept callbacks too (Dependency Injection) and the main struct ties them using closures (?).
Alternatively, you can design your data-structures so that they signal conditions, and the in the caller code you intercept them to bind things together.

Related

Confusion in Bjarne's PPP 2nd edition Pg. 316

• The function will be inline; that is, the compiler will try to generate code for the function at each point of call rather than using function-call instructions to use common code. This can be a significant performance advantage for functions, such as month(), that hardly do anything but
are used a lot.
• All uses of the class will have to be recompiled whenever we make a change to the body of an inlined function. If the function body is out of the class declaration, recompilation of users is needed only when the class declaration is itself changed. Not recompiling when the body is
changed can be a huge advantage in large programs.
• The class definition gets larger. Consequently, it can be harder to find the members among the member function definitions.
All uses of the class will have to be recompiled whenever we make a change to the body of an inlined function. If the function body is out of the class declaration, recompilation of users is needed only when the class declaration is itself changed. Not recompiling when the body is
changed can be a huge advantage in large programs.
I don't know what the book is trying to say exactly in this point. What do we mean by "have to be recompiled" and "recompilation is needed only when the class declaration is itself changed"
I suppose, from the context, that the quoted part discusses the pros & cons of putting member definitions inside the class declaration.
Suppose you have class X. You have to declare it somewhere. In a typical scenario, it will be placed in a header file whose only role will be to hold this declaration. Let's call it x.h.
A class usually has member functions. Now you can choose to either put them inside the header file inside the class declaration or in a separate file (typically: x.cpp).
Solution 1:
// file x.h contains everything
class X
{
public:
X() { std::cout << "X() has been hit\n"; }
};
Solution 2:
// file x.h contains only the declaration(s)
class X
{
public:
X();
};
// file x.cpp contains the class member definitions
#include "x.h"
X::X() { std::cout << "X() has been hit\n"; }
Whichever solution you use, you surely have some code that uses your class, and typically it is located in a different source file(s), e.g.:
// main.cpp
#include "x.h"
int main()
{
X x;
}
The first thing to notice: the user (here: main.cpp) looks the same whether you choose Solution 1 or 2. This is great. Now, here comes the message Bjarne wants to tell you: consider how changes to the class code will impact the users.
In Solution 1 you've packed everything into the header file. Any change to the class, even so apparently harmless as adding a new member function or just changing class formatting (you know, tabs, spaces, etc.) or adding a comment will force the compiler to recompile main.cpp. Why? Professional C++ programs are composed of many, many source files and their compilation is controlled and executed by special utility programs, like cmake, make, and many others. They simply look at the timestamps of the files that make up the program. Any change is a signal to recompile. Header files are never compiled, but all source files (= *.cpp) that include them (even indirectly, via other header files) have to be recompiled. This explains this:
All uses of the class will have to be recompiled whenever we make a change to the body of an inlined function.
(just to be sure: all class member functions declared inside the class declaration are considered inline by default). Here, main.cpp is an example of a "uses" mentioned above.
In Solution 2, file main.cpp will be recompiled only if x.h has been changed (in any way). If a programmer touches only x.cpp, then main.cpp will not be recompiled, because (a) C++ is designed in such a way to allow it and (b) professional C++ programs use other programs (I've mentioned above) that facilitate the efficient compilation of even large C++ programs. To be explicit: they are not compiled using commands like g++ *.cpp that can be found in some introductory C++ textbooks.
One final remark. The inline keyword was introduced essentially to allow Solution 1. Solution 2 is the original C language way. Solution 1 is sometimes used in C++ for better performance (but modern compilers can in many situations do the same job without it) and very often for templates (which are absent in C). Solution 1 is the most common way of programming templates, Solution 2 is typical for "ordinary" member functions. What Bjarne writes about is extremely important for library designers, I hope now you understand why.

What's the differences from inline and block compilation of SBCL?

Several weeks ago, SBCL updated 2.0.2 and brought the Block compilation feature. I have read this article to understand what it is.
I have a question, what's the difference between (declaim (inline 'some-function)) and Block compilation? Block compilation is automatic by the compiler?
Thanks.
Inline compilation is a specific optimization technique. A function being called is directly integrated into the calling function - usually using its source code - and then compiled.
This means that the inlined function might not be inlined only in one function, but in multiple functions.
Advantage: the overhead of calling a function disappears.
Disadvantage: the code size increases and the calling function(s) needs to be recompiled, when the inlined function changed and we want this change to become visible. Macros have the same problem.
Block compilation means that a bunch of code gets compiled together with different semantic constraints and that this enables the compiler to do a bunch of new optimizations.
Common Lisp has in the standard support for block compilation of single files. It allows the file compiler to assume that a file is such a block of code.
Example from the Common Lisp standard:
3.2.2.3 Semantic Constraints
A call within a file to a named function that is defined in the same file refers to that function, unless that function has been declared notinline. The consequences are unspecified if functions are redefined individually at run time or multiply defined in the same file.
This allows the code to call a global function and not use the symbol's function cell for the call. Thus this disables late binding for global function calls - in this file and for functions in this file.
It's not said how this can be achieved, but the compiler might just allocate the code somewhere and the calls just jump there.
So this part of block compilation is defined in the standard and some compilers are doing that.
Block compilation for multiple files
If the file compiler can use block compilation for one file, then what about multiple files? A few compilers can also tell the file compiler that several files make a block for compilation. CMUCL does that. SBCL was derived and simplified from CMUCL and lacks it until now. I think Lucid Common Lisp (which is no longer actively sold) did support something like that, too.
Might be useful to add this to SBCL, too.

Will go compilers ignore unused functions

If there is a function from an external package that is not used at all in my project, will the compiler remove the function from the generated machine code?
This question could be targeted at any language compiler in general. But, I think the behaviour may vary language to language. So, I am interested in knowing what does go compilers do.
I would appreciate any help on understanding this.
The language spec does not mention this anywhere, and from a correctness point of view this is irrelevant.
But know that the current version does remove certain constructs that the compiler can prove is not used and will not change the runtime behaviour of the app.
Quoting from The Go Blog: Smaller Go 1.7 binaries:
The second change is method pruning. Until 1.6, all methods on all used types were kept, even if some of the methods were never called. This is because they might be called through an interface, or called dynamically using the reflect package. Now the compiler discards any unexported methods that do not match an interface. Similarly the linker can discard other exported methods, those that are only accessible through reflection, if the corresponding reflection features are not used anywhere in the program. That change shrinks binaries by 5–20%.
Methods are a "harder" case than functions because methods can be listed and called with reflection (unlike functions), but the Go tools do what they can even to remove unused methods too.
You can see examples and proof of removed / unlinked code in this answer:
How to remove unused code at compile time?
Also see other relevant questions:
Splitting client/server code
Call all functions with special prefix or suffix in Golang

How many times does a Common Lisp compiler recompile?

While not all Common Lisp implementations do compilation to machine code, some of them do, including SBCL and CCL.
In C/C++, if the source files don't change, the binary output of a C/C++ compiler will also not change, assuming the underlying system remains the same.
In a Common Lisp compiler, the compilation is not under the user's direct control, unlike C/C++. My question is that if the Lisp source files haven't changed, under what circumstances will a CL compiler compile the code more than once, and why? If possible, a simple illustrative example would be helpful.
I think that the question is based on some misconceptions. The compiler doesn't compile files, and it's not something that the user has no control over. The compiler is quite readily available through the compile function. The compiler operates on code, not on files. E.g., you can type at the REPL
CL-USER> (compile nil (list 'lambda (list 'x) (list '+ 'x 'x)))
#<FUNCTION (LAMBDA (X)) {100460E24B}>
NIL
NIL
There's no file involved at all. However, there is also a compile-file function, but notice that its description is:
compile-file transforms the contents of the file specified by
input-file into implementation-dependent binary data which are placed
in the file specified by output-file.
The contents of the file are compiled. Then that compiled file can be loaded. (You can also load uncompiled source files, too.) I think your question might boil down to asking under what circumstances would compile-file generate a file with different contents. I think that's really implementation dependent, and it's not really predictable. I don't know that your characterization of compilers for other languages necessarily holds either:
In C/C++, if the source files don't change, the binary output of a
C/C++ compiler will also not change, assuming the underlying system
remains the same.
What if the compiler happens to include a timestamp into the output in some data segment? Then you'd get different binary output every time. It's true that some common scripted compilation/build systems (e.g., make and similar) will check whether previous output can be reused based on whether the input files have changed in the meantime. That doesn't really say what the compiler does, though.
The rules are pretty much the same, but in Common Lisp, it's not a practice to separate declarations from implementation, so usually you must recompile every dependency to be sure. This is a shared practical consequence of dynamic environments.
Imagining there was such separation in place, the following are blantant examples (clearly not exhaustive) of changes that require recompiling specific dependent files, as the output may be different:
A changed package definition
A changed macro character or a change in its code
A changed macro
Adding or removing a inline or notinline declaration
A change in a global type or function type declaration
A changed function used in #., defvar, defparameter, defconstant, load-time-value, eql specializer, make-load-form generated code, defmacro et al (e.g. setf expanders)...
A change in the Lisp compiler, or in the base image
I mean, you can see it's not trivial to determine which files need to be recompiled. Sometimes, the answer is "all subsequent files", e.g. changing the " (double-quotes) macro-character, which might affect every literal string, or the compiler evolved in a non-backwards compatible way. In essence, we end where we started: you can only be sure with a full recompile and not reusing fasls across compilations. And sometimes it's faster than determining the minimum set of files that need to be recompiled.
In practice, you end up compiling single definitions a lot in development (e.g. with Slime) and not recompiling files when there's a fasl as old or younger than the source file. Many times, you reuse files from e.g. Quicklisp. But for testing and deployment, I advise clearing all fasls and recompiling everything.
There have been efforts to automate minimum dependency compilation with SBCL, but I think it's too slow when you change the interim projects more often that not (it involves a lot of forking, so in Windows it's either infeasible or very slow). However, it may be a time saver for base libraries that rarely change, if at all.
Another approach is to make custom base images with base libraries built-in, i.e. those you always load. It'll save both compilation and load times.

Clean way to separate functions/subroutine declaration from definition in Fortran 90

I am working on a big Fortran 90 code, with a lot of modules. What bothers me is that when I modify the inner code of a function inside a module (without changing its mask), my Makefile (whose dependencies are based on "use") recompile every file that "use" that modified module, and recursively.
But when modifying the inner code of a function without touching its input/output, recompiling other files than the modified one is useless, no?
So I would like to separate the function declaration from their definition, like with the .h files in C or C++. What is the clean way to do this? Do I have to use Fortran include/preprocessor #include, or is there a "module/use" way of doing this?
I have tried something like this, but it seems to be quite nonsense...
main.f90
program prog
use foomod_header
integer :: i
bar=0
i=42
call foosub(i)
end program prog
foomod_header.f90
module foomod_header
integer :: bar
interface
subroutine foosub(i)
integer :: i
end subroutine
end interface
end module foomod_header
foomod.f90
module foomod
use foomod_header
contains
subroutine foosub(i)
integer ::i
print *,i+bar
end subroutine foosub
end module foomod
If submodules aren't an option (and they are ideal for this), then what you can do is make the procedure an external procedure and provide an interface for that procedure in a module. For example:
! Program.f90
PROGRAM p
USE Interfaces
IMPLICIT NONE
...
CALL SomeProcedure(xyz)
END PROGRAM p
! Interfaces.f90
MODULE Interfaces
IMPLICIT NONE
INTERFACE
SUBROUTINE SomeProcedure(some_arg)
USE SomeOtherModule
IMPLICIT NONE
TYPE(SomeType) :: some_arg
END SUBROUTINE SomeProcedure
END INTERFACE
END MODULE Interfaces
! SomeProcedure.f90
SUBROUTINE SomeProcedure(some_arg)
USE SomeOtherModule
IMPLICIT NONE
TYPE(SomeType) :: some_arg
...
END SUBROUTINE SomeProcedure
Some important notes:
There must only ever be one interface definition for a procedure accessible in a scope. Inside a subprogram the interface for the procedure defined by the subprogram is also considered defined - hence inside the subprogram you must not permit an interface block for procedures defined by the subprogram to be accessible. In terms of the example, this means that you must not have a USE Interfaces statement without an only clause inside the SomeProcedure external procedure.
If you do change the arguments or similar of the procedure inside SomeProcedure.f90 you had better make sure that you change the corresponding interface block inside the module!
If you can use F2003, the IMPORT statement can make life easier. Otherwise you might have to have additional modules (such as SomeOtherModule in the example) to share type definitions and the like between the Interfaces module and the external procedure.
If you have private entities or components relevant to the procedure then Fortran's rules entity and component accessibility may prevent you using this approach.
Typically some sort of whole program analysis is done at high levels of optimization. That analysis is typically much slower than the actual parsing of the code - splitting out procedures in this manner may not actually shorten build times significantly under these conditions.
Maybe the cleanest solution is to change the build system.
The real dependency introduced by a USE statement is not the source-code file, but the generated .mod file, which acts as a sort of "binary header file". I.e. where makefiles typically contain something like
MyProgram.o: MyModule.f90
what they really should contain is
MyProgram.o: MyModule.mod
MyModule.mod: MyModule.f90
with the creation of the .mod file being done in a way, that ensures an unchanged file-system timestamp, if the interface hasn't actually changed.
Sadly, compiler-support is awkward. Most compilers will overwrite the .mod file anyway, so the build process must at the same time detect, that the .mod file hasn't changed, e.g. by restoring the old modification time if the contents are unchanged, but at the same time needs to avoid recompiling the source file unnecessarily, which requires updating the modification time of the .mod file.
Additionally, some compilers (Intel, *cough*) add a binary time-stamp to the contents of the .mod files, that needs to be manually excluded from the comparison and has changed binary position across releases. This adds effort when supporting multiple compilers.

Resources