The terms "unit" and "library unit" are used in many places on the web site, but I failed to find documentation or even definitions of these terms. The only description that I found is in "User's Manual/Supported language/Declarations/(unit|uses)". Also there is an example in "User's Manual/Using the compiler/An example with multiple files". As you can see, very scarce.
If I ever get a response, the next question is how are "units" related to modules described in "User's Manual/Supported language/Modules"? I suppose that "units" somehow relate to compilation, while modules relate to Scheme value names.
Unit is short for "unit of compilation", which is basically a compiled library. If you look at the source code for CHICKEN, you'll notice that each unit from the manual corresponds (roughly) to a source file. Each source file is compiled separately into a .o file, and these units are all linked together into libchicken.so/libchicken.a.
This terminology is not very relevant anymore, except when you're linking statically. Then you need (declare (uses ...)), which refers to the unit name. This is needed, because the toplevel of the particular unit needs to run before the toplevels that depend upon it, so that any definitions are loaded.
In modern code you'll typically use only modules, but that means your code won't be statically linkable. We know this is confusing, which is why we're attempting to make static linking with modules easier with CHICKEN 5, and reducing the need to know about units.
Related
I'm moderately new to common lisp, but have extended experience with other 'separate compilation' languages (think C/C++/FORTRAN and such)
I know how to do an ASDF system definition. I know how to separate stuff in packages. I'm using SBCL, by the way.
The question is this: what's the best practice for splitting code (large packages) between .lisp files? I mean, in C there are include files, while lisp lives with the current image state. So with multiple files I need to handle dependencies or serial order in the system definition. But without something like forward declarations it's painful.
Simple example on what I want to do: I have, for example, two defstructs that are part of the same bigger data structure (like struct1 is a parent of some set of struct2). Some functions works on one, some other works on the other and some other use both.
So I would have: a packages.lisp, a fun1.lisp (with the first defstruct and related functions), a fun2.lisp (with the other defstruct and functions) and a funmix.lisp (with functions that use both). In an ideal world everything is sealed and compiling these in this order would be fine. As most of you know, this in practice almost never happen.
If I need to use struct2 functions from the struct1 ones I would need to either reorder or add a dependency. But then if there's some kind of back call (that can't be done with a closure) I would have struct1.lisp depending on struct2.lisp and vice-versa which is obviously not valid. So what? I could break the loop putting the defstruct in a separate file (say, structs.lisp) but what if either of the struct's function need to access the common functions in the third file? I would like to avoid style notes.
What's the common way to solve this, i.e. keeping loosely related code in the same file but still be able to interface to other ones. Is the correct solution to seal everything in a compilation unit (a single file)? use a package for every file with exports?
Lisp dependencies are simple, because in many cases, a Lisp implementation doesn't need to process the definition of something in order to compile its use.
Some exceptions to the rule are:
Macros: macros must be loaded in order to be expanded. There is a compile-time dependency between a file which uses macro and the file which defines them.
Packages: a package foo must be defined in order to use symbols like foo:bar or foo::priv. If foo is defined by a defpackage form in some foo.lisp file, then that file has to be loaded (either in source or compiled form).
Constants: constants defined with defconstant should be seen before their use. Similar remarks apply to inline functions, compiler macros.
Any custom things in a "domain specific language" which enforces definition before use. E.g. if Whizbang Inference Engine needs rules to be defined when uses of the rules are compiled, you have to arrange for that.
For certain diagnostics to be suppressed like calls to undefined functions, the defining and using files must be taken to be as a single compilation unit. (See below.)
All the above remarks also have implications for incremental recompilation.
When there is dependency like the above between files so that one is a prerequisite of the other, when the prerequisite is touched, the dependent one must be recompiled.
How to split code into files is going to be influenced by all the usual things: cohesion, coupling and what have you. Common-Lisp-specific reasons to keep certain things together in one file is inlining. The call to a function which is in the same file as the caller may be inlined. If your program supports any in-service upgrade, the granularity of code loading is individual files. If some functions foo and bar should be independently redefinable, don't put them in the same file.
Now about compilation units. Suppose you have a file foo.lisp which defines a function called foo and bar.lisp which calls (foo). If you just compile bar.lisp, you will likely get a warning that an undefined function foo has been called. You could compile foo.lisp first and then load it, and then compile bar.lisp. But that will not work if there is a circular reference between the two: say foo.lisp also calls (bar) which bar.lisp defines.
In Common Lisp, you can defer such warnings to the end of a compilation unit, and what defines a compilation unit isn't a single file, but a dynamic scope established by a macro called with-compilation-unit. Simply put, if we do this:
(with-compilation-unit
(compile-file "foo.lisp") ;; contains (defun foo () (bar))
(compile-file "bar.lisp")) ;; contains (defun bar () (foo))
If a compile-file isn't surrounded by with-compilation-unit then there is a compilation unit spanning that file. Otherwise, the outermost nesting of the with-compilation-unit macro determines the scope of what is in the compilation unit.
Warnings about undefined functions (and such) are deferred to the end of the compilation unit. So by putting foo.lisp and bar.lisp compilation into one unit, we suppress the warnings about either foo or bar not being defined and we can compile the two in any order.
Build systems use with-compilation-unit under the hood, as appropriate.
The compilation unit isn't about dependencies but diagnostics. Above, we don't have a compile time dependency. If we touch foo.lisp, bar.lisp doesn't have to be recompiled or vice versa.
By and large, Lisp codebases don't have a lot of hard dependencies among the files. Incremental compilation often means that just the affected files that were changed have to be recompiled. The C or C++ problem that everything has to be rebuilt because a core header file was touched is essentially nonexistent.
but what if
No matter how you first organize your code, if you change it significantly you are going to have to refactor. IMO there is no ideal way of grouping dependencies in advance.
As a rule of thumb it is generally safe to define generic functions first, then types, then actual methods, for example. For non-generic functions, you can cut circular dependencies by adding forward declarations:
(declaim (ftype function ...))
Having too much circular dependency is a bit of a code smell.
Is the correct solution to seal everything in a compilation unit
Yes, if you group the definitions in the same compilation unit (the same file), the file compiler will be able to silence the style notes until it reaches the end of file: at this point it knows if there are still missing references or if all the cross-references are resolved.
But then if there's some kind of back call (that can't be done with a closure)
If you have a specific example in mind please share, but typically you can define struct1 and its functions in a way that can be self-contained; maybe it can accept a map that binds event names to callbacks:
(make-struct-1 :callbacks (list :on-empty one-is-empty
:on-full one-is-full))
Similarly, struct2 can accept callbacks too (Dependency Injection) and the main struct ties them using closures (?).
Alternatively, you can design your data-structures so that they signal conditions, and the in the caller code you intercept them to bind things together.
A bit of context first.
I'm curently working on a modular (embedded) microOS with drivers, ports, partitions and other such objects represented as structures, with distinct "operations" structures containing pointer to what you would call their methods. Nothing fancy there.
I have macros and linker script bits to make it so that all objects of a given type (say, all drivers), though their definitions are scattered across source files, are disposed as in an array, somewhere in flash, but in a way that lets the linker (I work with GNU GCC/LD.) garbage collected those who aren't explicitly referenced in the code.
However, after a few years refining the system and increasing its flexibility, I come at a point where it is too flash-greedy for small to medium microcontrollers. (I work only with 32 bits architectures, nothing too small.) I was to be exepected, you might say, but I'm trying to go further and do better, and currently I'm doubting LD will let me do it.
What I would like to get is that methods/functions which aren't used by the code get garbage collected too. This isn't currently the case, since they are all referred to by the pointer structures of the objects owning them. I would like to avoid using macro configuration switches and the application developper having to go through several layers of code to determine which functions are currently used and which ones can be safely disabled. I would really, really like that the linker's garbage collection let me automatize all the process.
First I thought I could split each methods structure between sections, so that, say, all the ports' "probe" methods end up in a section called .struct_probe, and that the wrapping port_probe() function could reference a zero-length object inside that function, so that all ports' probe references get linked if and only if the port_probe() function is called somewhere.
But I was wrong in that, for the linker, the "input sections" which are his resolution for garbage collection (since at this point there's no more alignment information inside, and it couldn't afford to take advantage of a symbol - and afferent object - being removed by reordering the insides of the containing section and shrinking it) aren't identified solely by a section name, but by a section name and a source file. So if I implement what I intended to, none of my methods will get linked in the final executable, and I will be toast.
That's where I'm at currently, and frankly I'm quite at a loss. I wondered if maybe someone here would have a better idea for either having each method "backward reference" the wrapping function or some other object which would in turn be referenced by the function and take all methods along, or as the title says, somehow group those methods / sections (without gathering all the code in a single file, please) so that referencing one means linking them all.
My gratitude for all eternity is on the line, here. ;)
Since I have spent some time documenting and experimenting on the following lead, even though without success, I would like to expose here what I found.
There's a feature of ELF called "group sections", which are used to define sections groups, or groups of sections, such as if one member section of the group is live (hence linked), all are.
I hoped this was the answer to my question. TL;DR: It wasn't, because group sections are meant to group sections inside a module. Actually, the only type of groups currently defined is COMDAT groups, which are by definition exclusive from groups with the same name defined in other modules.
Documentation on that feature and its implementation(s) is scarce to say the least. Currently, the standard's definition of group sections can be found here.
GCC doesn't provide a construct to manipulate section groups (or any kind of sections flags/properties for that matter). The GNU assembler's documentation specifies how to affect a section to a group here.
I've found no evocation in any GNU document about LD's handling of group sections. It is mentioned here and here, though.
As a bonus, I've found a way to specify sections properties (including grouping) in C code with GCC. This is a dirty hack, so it may be it won't work anymore by the time you read this.
Apparently, when you write
int bar __attribute__((section("<name>")));
GCC takes what's between the quotes an blindly pastes it so in the assembly output:
.section <name>,"aw"
(The actual flags can differ if the name matches one of a few predefined patterns.)
From there, it's all a matter of code injection. If you write
int bar __attribute__((section("<name>,\"awG\",%probbits,<group> //")));
you get
.section <name>,"awG",%progbits,<group> //"aw"
and the job is done. If you wonder why simply characterizing the section in a separate inline assembly statement isn't enough, if you do that you get an empty grouped section and a stuffed solitary section with the same name, which won't have any effect at link time. So.
This isn't entierely satisfying, but for lack of a better way, that's what I went for:
It seems the only way you have to effectively merge sections from several compilation units from the linker's point of view is to first link the resulting objects together in one big object, then link the final program using that big object instead of the small ones. Sections that had the same name in the small objects will be merged in the big one.
This is a bit dirty, though, and has also some drawbacks, such as perhaps merging some sections you wouldn't want to be, for garbage collecting purposes, and hiding which file each section comes from (though the information remains in the debug sections) if, say, you wanted to split the main sections (.text, .data, .bss...) in the final ELF so as to be able to see what file contributes what amount to flash and RAM usage, for instance.
While not all Common Lisp implementations do compilation to machine code, some of them do, including SBCL and CCL.
In C/C++, if the source files don't change, the binary output of a C/C++ compiler will also not change, assuming the underlying system remains the same.
In a Common Lisp compiler, the compilation is not under the user's direct control, unlike C/C++. My question is that if the Lisp source files haven't changed, under what circumstances will a CL compiler compile the code more than once, and why? If possible, a simple illustrative example would be helpful.
I think that the question is based on some misconceptions. The compiler doesn't compile files, and it's not something that the user has no control over. The compiler is quite readily available through the compile function. The compiler operates on code, not on files. E.g., you can type at the REPL
CL-USER> (compile nil (list 'lambda (list 'x) (list '+ 'x 'x)))
#<FUNCTION (LAMBDA (X)) {100460E24B}>
NIL
NIL
There's no file involved at all. However, there is also a compile-file function, but notice that its description is:
compile-file transforms the contents of the file specified by
input-file into implementation-dependent binary data which are placed
in the file specified by output-file.
The contents of the file are compiled. Then that compiled file can be loaded. (You can also load uncompiled source files, too.) I think your question might boil down to asking under what circumstances would compile-file generate a file with different contents. I think that's really implementation dependent, and it's not really predictable. I don't know that your characterization of compilers for other languages necessarily holds either:
In C/C++, if the source files don't change, the binary output of a
C/C++ compiler will also not change, assuming the underlying system
remains the same.
What if the compiler happens to include a timestamp into the output in some data segment? Then you'd get different binary output every time. It's true that some common scripted compilation/build systems (e.g., make and similar) will check whether previous output can be reused based on whether the input files have changed in the meantime. That doesn't really say what the compiler does, though.
The rules are pretty much the same, but in Common Lisp, it's not a practice to separate declarations from implementation, so usually you must recompile every dependency to be sure. This is a shared practical consequence of dynamic environments.
Imagining there was such separation in place, the following are blantant examples (clearly not exhaustive) of changes that require recompiling specific dependent files, as the output may be different:
A changed package definition
A changed macro character or a change in its code
A changed macro
Adding or removing a inline or notinline declaration
A change in a global type or function type declaration
A changed function used in #., defvar, defparameter, defconstant, load-time-value, eql specializer, make-load-form generated code, defmacro et al (e.g. setf expanders)...
A change in the Lisp compiler, or in the base image
I mean, you can see it's not trivial to determine which files need to be recompiled. Sometimes, the answer is "all subsequent files", e.g. changing the " (double-quotes) macro-character, which might affect every literal string, or the compiler evolved in a non-backwards compatible way. In essence, we end where we started: you can only be sure with a full recompile and not reusing fasls across compilations. And sometimes it's faster than determining the minimum set of files that need to be recompiled.
In practice, you end up compiling single definitions a lot in development (e.g. with Slime) and not recompiling files when there's a fasl as old or younger than the source file. Many times, you reuse files from e.g. Quicklisp. But for testing and deployment, I advise clearing all fasls and recompiling everything.
There have been efforts to automate minimum dependency compilation with SBCL, but I think it's too slow when you change the interim projects more often that not (it involves a lot of forking, so in Windows it's either infeasible or very slow). However, it may be a time saver for base libraries that rarely change, if at all.
Another approach is to make custom base images with base libraries built-in, i.e. those you always load. It'll save both compilation and load times.
I have some question regarding dynamic initialization (i.e. constructors before main) and DLL link ordering - for both Windows and POSIX.
To make it easier to talk about, I'll define a couple terms:
Load-Time Libraries: libraries which have been "linked" at compile
time such that, when the system loads my application, they get loaded
in automatically. (i.e. ones put in CMake's target_link_libraries
command).
Run-Time Libraries: libraries which I load manually by dlopen or
equivalents. For the purposes of this discussion, I'll say that I only
ever manually load libraries using dlopen in main, so this should
simplify things.
Dynamic Initialization: if you're not familiar with the C++ spec's
definition of this, please don't try to answer this question.
Ok, so let's say I have an application (MyAwesomeApp) and it links against a dynamic library (MyLib1), which in turn links against another library (MyLib2). So the dependency tree is:
MyAwesomeApp -> MyLib1 -> MyLib2
For this example, let's say MyLib1 and MyLib2 are both Load-Time Libraries.
What's the initialization order of the above? It is obvious that all static initialization, including linking of exported/imported functions (windows-only) will occur first... But what happens to Dynamic Initialization? I'd expect the overall ordering:
ALL import/export symbol linking
ALL Static Initialization
ALL of MyLib2's Dynamic Initialization
ALL of MyLib1's Dynamic Initialization
ALL of MyAwesomeApp's Dynamic Initialization
MyAwesomeApp's main() function
But I can't find anything in specs that mandate this. I DID see something with elf that hinted at it, but I need to find guarantees in specs for me to do something I'm trying to do.
Just to make sure my thinking is clear, I'd expect that library loading works very similarly to 'import in Python in that, if it hasn't been loaded yet, it'll be loaded fully (including any initialization) before I do anything... and if it has been loaded, then I'll just link to it.
To give a more complex example to make sure there isn't another definition of my first example that yields a different response:
MyAwesomeApp depends on MyLib1 & MyLib2
MyLib1 depends on MyLib2
I'd expect the following initialization:
ALL import/export symbol linking
ALL Static Initialization
ALL of MyLib2's Dynamic Initialization
ALL of MyLib1's Dynamic Initialization
ALL of MyAwesomeApp's Dynamic Initialization
MyAwesomeApp's main() function
I'd love any help pointing out specs that say this is how it is. Or, if this is wrong, any spec saying what REALLY happens!
Thanks in advance!
-Christopher
Nothing in the C++ standard mandates how dynamic linking works.
Having said that, Visual Studio ships with the C Runtime (aka CRT) source, and you can see where static initializers get run in dllcrt0.c.
You can also derive the relative ordering of operations if you think about what constraints need to be satisfied to run each stage:
Import/export resolution just needs .dlls.
Static initialization just needs .dlls.
Dynamic initialization requires all imports to be resolved for the .dll.
Step 1 & 2 do not depend on each other, so they can happen independently. Step 3 requires 1 & 2 for each .dll, so it has to happen after both 1 & 2.
So any specific loading order that satisfies the constraints above will be a valid loading order.
In other words, if you need to care about the specific ordering of specific steps, you probably are doing something dangerous that relies on implementation specific details that will not be preserved across major or minor revisions of the OS. For example, the way the loader lock works for .dlls has changed significantly over the various releases of Windows.
I try to understand LLVM program high level structure.
I read in the book that "programs are composed of modules ,each of which correspons to tranlation unit".Can someone explain me in more details the above and what is the diffrenece between modules and translation units(if any).
I am also interested to know which part of the code is called when translation unit starts and completes debugging information encoding?
Translation unit is term from language standard. For example, this is from C (c99 iso draft)
5.1 Conceptual models; 5.1.1 Translation environment; 5.1.1.1 Program structure
A C program need not all be translated at the same time. The text of the program is kept
in units called source files, (or preprocessing files) in this International Standard. A
source file together with all the headers and source files included via the preprocessing
directive #include is known as a preprocessing translation unit. After preprocessing, a
preprocessing translation unit is called a translation unit.
So, translation unit is the single source file (file.c) after preprocessing (all #included *.h files instantiated, all macro are expanded, all comments are skipped, and file is ready for tokenizing).
Translation unit is a unit of compiling, because it didn't depend on any external resource until linking step. All headers are within TU.
Term module is not defined in the language standard, but it AFAIK refers to translation unit at deeper translation phases.
LLVM describes it as: http://llvm.org/docs/ProgrammersManual.html
The Module class represents the top level structure present in LLVM programs. An LLVM module is effectively either a translation unit of the original program or a combination of several translation units merged by the linker.
The Module class keeps track of a list of Functions, a list of GlobalVariables, and a SymbolTable. Additionally, it contains a few helpful member functions that try to make common operations easy.
About this part of your question:
I am also interested to know which part of the code is called when translation unit starts and completes debugging information encoding?
This depends on how LLVM is used. LLVM itself is a library and can be used in various ways.
For clang/LLVM (C/C++ complier build on libclang and LLVM) the translation unit created after preprocessing stage. It will be parsed into AST, then into LLVM assembly and saved in Module.
For tutorial example, here is a creation of Modules http://llvm.org/releases/2.6/docs/tutorial/JITTutorial1.html