Is it possible to have a hotfix at runtime with executable memory heaps and a distributed system? - runtime

I've been looking over a few tutorials for JIT and allocating heaps of executable memory at runtime. This is mainly a conceptual question, so please correct me if I got something wrong.
If I understand it correctly, a JIT takes advantage of a runtime interpreter/compiler that outputs native or executable code and, if native binary, places it in an executable code heap in memory, which is OS-specific (e.g. VirtualAlloc() for Windows, mmap() for Linux).
Additionally, some languages like Erlang can have a distributed system such that each part is separated from each other, meaning that if one fails, the others can account for such a case in a modular way, meaning that modules can also be switched in and out at will if managed correctly without disturbing overall execution.
With a runtime compiler or some sort of code delivery mechanism, wouldn't it be feasible to load code at runtime arbitrarily to replace modules of code that could be updated?
Example
Say I have a sort(T, T) function that operates on T or T. Now, suppose I have a merge_sort(T,T) function that I have loaded at runtime. If I implement a sort of ledger or register system such that users of the first sort(T,T) can reassign themselves to use the new merge_sort(T,T) function and detect when all users have adjusted themselves, I should then be able to deallocate and delete sort(T,T) from memory.
This basically sounds a lot like a JIT, but the attractive part, to me, was the aspect where you can swap out code arbitrarily at runtime for modules. That way, while a system is not under a full load such that each module is being used, modules could be automated to switch to new code, if needed, and etc. Theoretically, wouldn't this be a way to implement patches such that a user who uses a program should never have to "restart" the program if the program can swap out code silently in the individual modules? I'd imagine much larger distributed systems can make use of this, but what about smaller ones?

Additionally, some languages like Erlang can have a distributed system
such that each part is separated from each other, meaning that if one
fails, the others can account for such a case in a modular way,
meaning that modules can also be switched in and out at will if
managed correctly without disturbing overall execution.
You're describing how to make a fault-tolerant system which is entirely different from replacing code at run-time (known at Dynamic Software Update or DSU). Indeed, in Erlang, you can have one process monitoring other processes and if one fails, it will migrate the work to another process to keep the system running as expected. Note that DSU is not used to implement fault-tolerance. They are different features with different purposes.
Say I have a sort(T, T) function that operates on T or T. Now, suppose
I have a merge_sort(T,T) function that I have loaded at runtime. If I
implement a sort of ledger or register system such that users of the
first sort(T,T) can reassign themselves to use the new merge_sort(T,T)
function and detect when all users have adjusted themselves, I should
then be able to deallocate and delete sort(T,T) from memory.
This is called DSU and is used to be able to do any of the following tasks without the need to take the system down:
Fix one or more bugs in a piece of code.
Patch security holes.
Employ a more efficient code.
Deploy new features.
Therefore, any app or system can use DSU so that it can perform these tasks without requiring a restart.
Erlang enables you to perform DSU in addition to facilitating fault-tolerance as discussed above. For more information, refer to this Erlang write paper.
There are numerous ways to implement DSU. Since you're interested in JIT compilers and assuming that by "JIT compiler" you mean the component that not only compiles IL code but also allocates executable memory and patches function calls with binary code addresses, I'll discuss how to implement DSU in JIT environments. The JIT compiler has to support the following two features:
The ability to obtain or create new binary code at run-time. If you have IL code, no need to allocate executable memory yet since it has to be compiled.
The ability to replace a piece of IL code (which might have already been JITted) or binary code with the new piece of code.
Clearly, with these two features, you can perform DSU on a single function. Swapping a whole module or library requires swapping all the functions and global variables exported by that module.

Related

Hack the JVM to avoid unnecessary bounds checks and casts

There are some languages that support a sufficiently powerful type system that they can prove at compile time that the code does not address an array outside its bounds. My question is that if we were to compile such a language to the JVM, is there some way we could take advantage of that for performance and remove the array bounds checks that occur on every array access?
1) I know that recent JDK supports some array bound check elimination, but since I know at compile time that certain calls are safe, I could remove a lot more safely.
2) Some might think this doesn't affect performance much but it most certainly does, especially in array/computation heavy applications such as scientific computing.
The same question regarding casting. I know something is a certain type, but Java doesn't because its limited type system. Is there some way to just tell the JVM to "trust me" and skip any checks?
I realize there is probably no way to do this as the JVM is generally distributed, could it be reasonable to modify a JVM with this feature? Is this something that has been done?
It's one of the frustrations in compiling a more powerfully typed language to the JVM, it still is hampered by Java's limitations.
In principle this cannot be done in a safe fashion without a proof-carrying code (PCC) infrastructure. PCC would allow you to embed your reasoning of safety in the class file. Your embedded proof is checked at class-loading time. The class is not loaded if there is a flaw in the proof.
If the JVM ever allowed you to drop runtime checks without requiring a formal proof, then, as SecurityMatt put it, it would defeat the original philosophy of Java as a safe platform.
The JVM uses a special form of PCC for type-checking local variables in a method. All local variable typing info is used by the class-loading mechanism to check its correctness, but discarded after that. But that's the only instance of PCC concepts used in the JVM. As far as I know there is no general PCC infrastructure for the JVM.
I once heard one existed for the JavaCard platform which supports a small subset of Java. I am not sure if that can be helpful in your problem though.
One of the key features of Java is that it does not need to "trust" the developer to do bounds checking. This eliminates the "buffer overflow" security vulnerabilities which can lead to attackers being able to execute arbitrary code within your application.
By allowing developers the ability to turn off bounds checking, Java would lose one of its key features - that no matter how wrong the Java developer is, there is not going to be any exploitable buffer overflows within his/her code.
If you would like to use a language where the programmer is trusted to manage their own bounds checking, might I suggest C++. This gives you the ability to allocate arrays with no automatic bounds checking (new int[]) and to allocate arrays with inbuilt bounds checking (std::vector).
Additionally, I strongly suggest that before blaming bounds checking for the speed loss in your application, you perform some BENCHMARKING to determine whether there is somewhere else in your code that might be causing the bottleneck.
You may find that for a compiler target that a bytecode language such as MSIL is more suited to your needs than Java bytecode. MSIL is strongly typed and does not suffer from a number of the inefficiencies that you have found in Java.

Static call graph generation for the Linux kernel

I'm looking for a tool to statically generate a call graph of the Linux kernel (for a given kernel configuration). The generated call graph should be "complete", in the sense that all calls are included, including potential indirect ones which we can assume are only done through the use of function pointers in the case of the Linux kernel.
For instance, this could be done by analyzing the function pointer types: this approach would lead to superfluous edges in the graph, but that's ok for me.
ncc seems to implement this idea, however I didn't succeed in making it work on the 3.0 kernel. Any other suggestions?
I'm guessing this approach could also lead to missing edges in cases where function pointer casts are used, so I'd also be interested in knowing whether this is likely in the Linux kernel.
As a side note, there seems to be other tools that are able to do semantic analysis of the source to infer potential pointer values, but AFAICT, none of them are design to be used in a project such as the Linux kernel.
Any help would be much appreciated.
We've done global points-to analysis (with indirect function pointers) and full call graph construction of monolithic C systems of 26 million lines (18,000 compilation units).
We did it using our DMS Software Reengineering Toolkit, its C Front End and its associated flow analysis machinery. The points-to analysis machinery (and the other analyses) are conservative; yes, you get some bogus points-to and therefore call edges as a consequence. These are pretty hard to avoid.
You can help such analyzers by providing certain crucial facts about key functions, and by harnessing knowledge such as "embedded systems [and OSes] tend not to have cycles in the call graph", which means you can eliminate some of these. Of course, you have to allow for exceptions; my moral: "in big systems, everything happens."
The particular problem included dynamically loaded(!) C modules using a special loading scheme specific to this particular software, but that just added to the problem.
Casts on function pointers shouldn't lose edges; a conservative analysis should simply assume that the cast pointer matches any function in the system with signature corresponding to the casted result. More problematic are casts which produce sort-of-compatible signatures; if you cast a function pointer to void* foo(uint) when the actual function being called accepts an int, the points to analysis will necessarily conservatively choose the wrong functions. You can't blame the analyzer for that; the cast lies in that case. Yes, we saw this kind of trash in the 26 million line system.
This is certainly the right scale for analyzing Linux (which I think is a mere 8 million lines or so :-). But we haven't tried it specifically on Linux.
Setting up this tool is complicated because you have to capture all the details about the compilations themselves, and in particular the configuration of the Linux kernal you want to generate. So you pretty much have to intercept the compiler calls to get the command line switches, etc.

Distributed array in MPI for parallel numerics

in many distributed computing applications, you maintain a distributed array of objects. Each process manages a set of objects that it may read and write exclusively and furthermore a set of objects that may only read (the content of which is authored by and frequently recerived from other processes).
This is very basic and is likely to have been done a zillion times until times until now - for example, with MPI. Hence I suppose there is something like an open source extension for MPI, which provides the basic capabilities of a distributed array for computing.
Ideally, it would be written in C(++) and mimic the official MPI standard interface style. Does anybody know anything like that? Thank you.
From what I gather from your question, you're looking for a mechanism for allowing a global view (read-only) of the problem space, but each process has ownership (read-write) of a segment of the data.
MPI is simply an API specification for inter-process communication for parallel applications and any implementation of it will work at a level lower than what you are looking for.
It is quite common in HPC applications to perform data decomposition in a way that you mentioned, with MPI used to synchronise shared data to other processes. However each application have different sharing patterns and requirements (some may wish to only exchange halo regions with neighbouring nodes, and perhaps using non-blocking calls to overlap communication other computation) so as to improve performance by making use of knowledge of the problem domain.
The thing is, using MPI to sync data across processes is simple but implementing a layer above it to handle general purpose distribute array synchronisation that is easy to use yet flexible enough to handle different use cases can be rather trickly.
Apologies for taking so long to get to the point, but to answer your question, AFAIK there isn't be an extension to MPI or a library that can efficiently handle all use cases while still being easier to use than simply using MPI. However, it is possible to to work above the level of MPI which maintaining distributed data. For example:
Use the PGAS model to work with your data. You can then use libraries such as Global Arrays (interfaces for C, C++, Fortran, Python) or languages that support PGAS such as UPC or Co-Array Fortran (soon to be included into the Fortran standards). There are also languages designed specifically for this form of parallelism, i,e. Fortress, Chapel, X10
Roll your own. For example, I've worked on a library that uses MPI to do all the dirty work but hides the complexity by providing creating custom data types for the application domain, and exposing APIs such as:
X_Create(MODE, t_X) : instantiate the array, called by all processes with the MODE indicating if the current process will require READ-WRITE or READ-ONLY access
X_Sync_start(t_X) : non-blocking call to initiate synchronisation in the background.
X_Sync_complete(t_X) : data is required. Block if synchronisation has not completed.
... and other calls to delete data as well as perform domain specific tasks that may require MPI calls.
To be honest, in most cases it is often simpler to stick with basic MPI or OpenMP, or if one exists, using a parallel solver written for the application domain. This of course depends on your requirements.
For dense arrays, see Global Arrays and Elemental (Google will find them for you).
For sparse arrays, see PETSc.
I know this is a really short answer, but there is too much documentation of these elsewhere to bother repeating it.

Assembly Analysis Tools

Does anyone have any suggestions for assembly file analysis tools? I'm attempting to analyze ARM/Thumb-2 ASM files generated by LLVM (or alternatively GCC) when passed the -S option. I'm particularly interested in instruction statistics at the basic block level, e.g. memory operation counts, etc. I may wind up rolling my own tool in Python, but was curious to see if there were any existing tools before I started.
Update: I've done a little searching, and found a good resource for disassembly tools / hex editors / etc here, but unfortunately it is mainly focused on x86 assembly, and also doesn't include any actual assembly file analyzers.
What you need is a tool for which you can define an assembly language syntax, and then build custom analyzers. You analyzers might be simple ("how much space does an instruction take?") or complex ("How many cycles will this isntruction take to execute?" [which depends on the preceding sequence of instructions and possibly a sophisticated model of the processor you care about]).
One designed specifically to do that is the New Jersey Machine Toolkit. It is really designed to build code generators and debuggers. I suspect it would be good at "instruction byte count". It isn't clear it is good at more sophisticated analyses. And I believe it insists you follow its syntax style, rather than yours.
One not designed specifically to do that, but good at parsing/analyzing langauges in general is our
DMS Software Reengineering Toolkit.
DMS can be given a grammar description for virtually any context free language (that covers most assembly language syntax) and can then parse a specific instance of that grammar (assembly code) into ASTs for further processing. We've done with with several assembly langauges, including the IBM 370, Motorola's 8 bit CPU line, and a rather peculiar DSP, without trouble.
You can specify an attribute grammar (computation over an AST) to DMS easily. These are great way to encode analyses that need just local information, such as "How big is this instruction?". For more complex analysese, you'll need a processor model that is driven from a series of instructions; passing such a machine model the ASTs for individual instructions would be an easy way to apply a machine model to compute more complex things as "How long does this instruction take?".
Other analyses such as control flow and data flow, are provided in generic form by DMS. You can use an attribute evaluator to collect local facts ("control-next for this instruction is...", "data from this instruction flows to,...") and feed them to the flow analyzers to compute global flow facts ("if I execute this instruction, what other instructions might be executed downstream?"..)
You do have to configure DMS for your particular (assembly) language. It is designed to be configured for tasks like these.
Yes, you can likely code all this in Python; after all, its a Turing machine. But likely not nearly as easily.
An additional benefit: DMS is willing to apply transformations to your code, based on your analyses. So you could implement your optimizer with it, too. After all, you need to connect the analysis indication the optimization is safe, to the actual optimization steps.
I have written many disassemblers, including arm and thumb. Not production quality but for the purposes of learning the assembler. For both the ARM and Thumb the ARM ARM (ARM Architectural Reference Manual) has a nice chart from which you can easily count up data operations from load/store, etc. maybe an hours worth of work, maybe two. At least up front, you would end up with data values being counted though.
The other poster may be right, as with the chart I am talking about it should be very simple to write a program to examine the ASCII looking for ldr, str, add, etc. No need to parse everything if you are interested in memory operations counts, etc. Of course the downside is that you are likely not going to be able to examine loops. One function may have a load and store, another may have a load and store but have it wrapped by a loop, causing many more memory operations once executed.
Not knowing what you really are interested in, my guess is you might want to simulate the code and count these sorts of things. I wrote a thumb simulator (thumbulator) that attempts to do just that. (and I have used it to compare llvm execution vs gcc execution when it comes to number of instructions executed, fetches, memory operations, etc) The problem may be that it is thumb only, no ARM no Thumb2. Thumb2 could be added easier than ARM. There exists an armulator from arm, which is in the gdb sources among other places. I cant remember now if it executes thumb2. My understanding is that when arm was using it would accurately tell you these sorts of statistics.
You can plug your statistics into LLVM code generator, it's quite flexible and it is already collecting some stats, which could be used as an example.

how to minimize a programming language compile time?

I was thinking more about the programming language i am designing. and i was wondering, what are ways i could minimize its compile time?
Your main problem today is I/O. Your CPU is many times faster than main memory and memory is about 1000 times faster than accessing the hard disk.
So unless you do extensive optimizations to the source code, the CPU will spend most of the time waiting for data to be read or written.
Try these rules:
Design your compiler to work in several, independent steps. The goal is to be able to run each step in a different thread so you can utilize multi-core CPUs. It will also help to parallelize the whole compile process (i.e. compile more than one file at the same time)
It will also allow you to load many source files in advance and preprocess them so the actual compile step can work faster.
Try to allow to compile files independently. For example, create a "missing symbol pool" for the project. Missing symbols should not cause compile failures as such. If you find a missing symbol somewhere, remove it from the pool. When all files have been compiled, check that the pool is empty.
Create a cache with important information. For example: File X uses symbols from file Y. This way, you can skip compiling file Z (which doesn't reference anything in Y) when Y changes. If you want to go one step further, put all symbols which are defined anywhere in a pool. If a file changes in such a way that symbols are added/removed, you will know immediately which files are affected (without even opening them).
Compile in the background. Start a compiler process which checks the project directory for changes and compile them as soon as the user saves the file. This way, you will only have to compile a few files each time instead of everything. In the long run, you will compile much more but for the user, turnover times will be much shorter (= time user has to wait until she can run the compiled result after a change).
Use a "Just in time" compiler (i.e. compile a file when it is used, for example in an import statement). Projects are then distributed in source form and compiled when run for the first time. Python does this. To make this perform, you can precompile the library during the installation of your compiler.
Don't use header files. Keep all information in a single place and generate header files from the source if you have to. Maybe keep the header files just in memory and never save them to disk.
what are ways i could minimize its compile time?
No compilation (interpreted language)
Delayed (just in time) compilation
Incremental compilation
Precompiled header files
I've implemented a compiler myself, and ended up having to look at this once people started batch feeding it hundreds of source files. I was quite suprised what I found out.
It turns out that the most important thing you can optimize is not your grammar. It's not your lexical analyzer or your parser either. Instead, the most important thing in terms of speed is the code that reads in your source files from disk. I/O's to disk are slow. Really slow. You can pretty much measure your compiler's speed by the number of disk I/Os it performs.
So it turns out that the absolute best thing you can do to speed up a compiler is to read the entire file into memory in one big I/O, do all your lexing, parsing, etc. from RAM, and then write out the result to disk in one big I/O.
I talked with one of the head guys maintaining Gnat (GCC's Ada compiler) about this, and he told me that he actually used to put everything he could onto RAM disks so that even his file I/O was really just RAM reads and writes.
In most languages (pretty well everything other than C++), compiling individual compilation units is quite fast.
Binding/linking is often what's slow - the linker has to reference the whole program rather than just a single unit.
C++ suffers as - unless you use the pImpl idiom - it requires the implementation details of every object and all inline functions to compile client code.
Java (source to bytecode) suffers because the grammar doesn't differentiate objects and classes - you have to load the Foo class to see if Foo.Bar.Baz is the Baz field of object referenced by the Bar static field of the Foo class, or a static field of the Foo.Bar class. You can make the change in the source of the Foo class between the two, and not change the source of the client code, but still have to recompile the client code, as the bytecode differentiates between the two forms even though the syntax doesn't. AFAIK Python bytecode doesn't differentiate between the two - modules are true members of their parents.
C++ and C suffer if you include more headers than are required, as the preprocessor has to process each header many times, and the compiler compile them. Minimizing header size and complexity helps, suggesting better modularity would improve compilation time. It's not always possible to cache header compilation, as what definitions are present when the header is preprocessed can alter its semantics, and even syntax.
C suffers if you use the preprocessor a lot, but the actual compilation is fast; much of C code uses typedef struct _X* X_ptr to hide implementation better than C++ does - a C header can easily consist of typedefs and function declarations, giving better encapsulation.
So I'd suggest making your language hide implementation details from client code, and if you are an OO language with both instance members and namespaces, make the syntax for accessing the two unambiguous. Allow true modules, so client code only has to be aware of the interface rather than implementation details. Don't allow preprocessor macros or other variation mechanism to alter the semantics of referenced modules.
Here are some performance tricks that we've learned by measuring compilation speed and what affects it:
Write a two-pass compiler: characters to IR, IR to code. (It's easier to write a three-pass compiler that goes characters -> AST -> IR -> code, but it's not as fast.)
As a corollary, don't have an optimizer; it's hard to write a fast optimizer.
Consider generating bytecode instead of native machine code. The virtual machine for Lua is a good model.
Try a linear-scan register allocator or the simple register allocator that Fraser and Hanson used in lcc.
In a simple compiler, lexical analysis is often the greatest performance bottleneck. If you are writing C or C++ code, use re2c. If you're using another language (which you will find much more pleasant), read the paper aboug re2c and apply the lessons learned.
Generate code using maximal munch, or possibly iburg.
Surprisingly, the GNU assembler is a bottleneck in many compilers. If you can generate binary directly, do so. Or check out the New Jersey Machine-Code Toolkit.
As noted above, design your language to avoid anything like #include. Either use no interface files or precompile your interface files. This tactic dramatically reduces the burdern on the lexer, which as I said is often the biggest bottleneck.
Here's a shot..
Use incremental compilation if your toolchain supports it.
(make, visual studio, etc).
For example, in GCC/make, if you have many files to compile, but only make changes in one file, then only that one file is compiled.
Eiffel had an idea of different states of frozen, and recompiling didn't necessarily mean that the whole class was recompiled.
How much can you break up the compliable modules, and how much do you care to keep track of them?
Make the grammar simple and unambiguous, and therefore quick and easy to parse.
Place strong restrictions on file inclusion.
Allow compilation without full information whenever possible (eg. predeclaration in C and C++).
One-pass compilation, if possible.
One thing surprisingly missing in answers so far: make you you're doing a context free grammar, etc. Have a good hard look at languages designed by Wirth such as Pascal & Modula-2. You don't have to reimplement Pascal, but the grammar design is custom made for fast compiling. Then see if you can find any old articles about the tricks Anders pulled implementing Turbo Pascal. Hint: table driven.
it depends on what language/platform you're programming for. for .NET development, minimise the number of projects that you have in your solution.
In the old days you could get dramatic speedups by setting up a RAM drive and compiling there. Don't know if this still holds true, though.
In C++ you could use distributed compilation with tools like Incredibuild
A simple one: make sure the compiler can natively take advantage of multi-core CPUs.
Make sure that everything can be compiled the fist time you try to compile it. E.g. ban forward references.
Use a context free grammar so that you can find the correct parse tree without a symbol table.
Make sure that the semantics can be deduced from the syntax so you can construct the correct AST directly rather than by mucking with a parse tree and symbol table.
How serious a compiler is this?
Unless the syntax is pretty convoluted, the parser should be able to run no more than 10-100 times slower than just indexing through the input file characters.
Similarly, code generation should be limited by output formatting.
You shouldn't be hitting any performance issues unless you're doing a big, serious compiler, capable of handling mega-line apps with lots of header files.
Then you need to worry about precompiled headers, optimization passes, and linking.
I haven't seen much work done for minimizing the compile time. But some ideas do come to mind:
Keep the grammar simple. Convoluted grammar will increase your compile time.
Try making use of parallelism, either using multicore GPU or CPU.
Benchmark a modern compiler and see what are the bottlenecks and what you can do in you compiler/language to avoid them.
Unless you are writing a highly specialized language, compile time is not really an issue..
Make a build system that doesn't suck!
There's a huge amount of programs out there with maybe 3 source files that take under a second to compile, but before you get that far you'd have to sit through an automake script that takes about 2 minutes checking things like the size of an int. And if you go to compile something else a minute later, it makes you sit through almost exactly the same set of tests.
So unless your compiler is doing awful things to the user like changing the size of its ints or changing basic function implementations between runs, just dump that info out to a file and let them get it in a second instead of 2 minutes.

Resources