Runtime bytecode generation performance for larger methods and classes - performance

There are many libraries for runtime bytecode generation such as ASM, Javassist, CGLib, and BCEL to name a few. All of these tools are capable of manipulating java bytecode dynamically, and are different from tools like the javac compiler.
I understand that there are some good reasons to generate bytecode and load them into a ClassLoader at runtime. My question is whether or not there are any performance issues or concerns with these tools when generating bytecode for java methods or classes which could be very large.
One scenario might be an application which keeps running for a long time and the generated bytecode would be trivial but continuous (it would keep generating bytecode and/or classes and load/unload them into a classloader continuously).
There is a similar question here, but none of the answers explain any questions about performance. May I have some links to academic articles regarding this issue?

In a real world it won't really matter which framework you'll use. Unless you are planning to generate millions of new methods and load them at the run time, which would be a bad idea to begin with.

Generating a class at runtime is nothing fancier than filling a byte array with contents. At the point, where the JVM is told to interpret this contents as a Java class, it doesn’t differ from the way a precompiled class loaded from the hard drive is added to the runtime environment.
Since filling a byte array is trivial, the performance depends on the rules which determine the contents. Parsing source code and validating its correctness is an expensive task. On the other hand, generating code according to hardcoded rules, like, e.g. fulfill an interface by calling a single specified method (like lambda instantiation works), usually works much faster than loading the equivalent code from hard drive. Having such rules the typical use case for runtime bytecode generation.
But before thinking about performance you should ask yourself why you are thinking about dynamic byte code generation at all. In most real life scenarios the answer to this question already contains the answer to the question whether performance is relevant at all or why it is expected to be improved by generating code.

I think ASM is the strongest choice for two reasons. First, it's up to date with all of the JVM features, and second, its Visitor Pattern API is very efficient. This second point addresses your performance concerns, I think.

Related

Will using Sorbet's ruby type checker have an impact on a ruby app's performance?

Maybe a newb's question but if you never ask youll never know
Will using Stripe's Sorbet (https://sorbet.org/) on a RoR app, can potentially improve the app's performance?
(performance meaning response times, not robustness \ runtime error rate)
I did some reading on dynamically typed languages (particularly Javascript in this case) and found out that if we keep sending some function (foo for example) the same type of objects, the engine does some optimising work on that function, so that when it is invoked again with the same types, there interpreting work would be quicker.
I thought maybe ruby interpreter does a similar work which can potentially mean that type-checking may increase interpreting speed
I thought maybe ruby interpreter does a similar work which can potentially mean that type-checking may increase interpreting speed
It doesn't yet, but one could potentially build this one day.
Goal of Sorbet was to build a type system for people, compared to building a type system for computers(compiler). It can introduce some performance overhead, but as Stripe runs it in production, we keep it in check. Internally, we page us if overhead is >7% of cpu time.
I did some reading on dynamically typed languages (particularly Javascript in this case) and found out that if we keep sending some function (foo for example) the same type of objects, the engine does some optimising work on that function, so that when it is invoked again with the same types, there interpreting work would be quicker.
Yes, this can be done. What you're describing is a common optimization in Just-In-Time(JIT) compilers. The technique that you seem to refer to uses run time profiling and actually is a common alternative technique that allows to achieve this result in absence of type system. It's also worth noting that well-build JITs can do it more frequently than a type system, as type system encodes what could happen, while profiling & JITs can optimize for what actually happens in practice.
That said, building a JIT is frequently much more work than building an online compiler, thus, depending on amount of investment one wants to put into speeding up Ruby, either using building a JIT or using types can prove better under different real-world constrains.
I thought maybe ruby interpreter does a similar work which can potentially mean that type-checking may increase interpreting speed
Summarizing the previous paragraph, Sorbet typesystem does not currently speedup Ruby, but it doesn't slow it down much either.
Type systems could be indeed used to speed up languages, but they aren't your only tool, with profiling & JIT compilation being the major competitor.
the optimizations you are talking about apply more to the JIT that is beeing worked on for the ruby runtime.
in general, sorbet aims at type-safety by introducing type interfaces or method signatures. they enable static type-checks that are applied before deploying the application in order to get rid of "type errors".
sorbet comes with a runtime component that can enforce type checks at runtime in your runnable application, but those are going to decrease the applications performance as they wrap method-calls in order to check for correct types https://sorbet.org/docs/runtime#runtime-checked-sig-s

Is There a Way of Providing asm.js or WebAssembly Code to V8 Turbofan?

After looking into the recently announce support for WebAssembly, it occurs to me that it would dramatically increase its utility if there were some way to:
Have TurboFan, the successor to the V8 JIT Crankshaft optimizer output all the assembly code it generates along with the static type signatures, and execution profile of that generated code.
Permit the programmer to provide his own asm.js/WebAssembly code for specific static type signatures that override the optimizer.
Is there some way to do this already?
There is some indication that it may be from the following passage from this article:
Under the hood, the WebAssembly implementation in V8 is designed to
reuse much of the existing JavaScript virtual machine infrastructure,
specifically the TurboFan compiler. A specialized WebAssembly decoder
validates modules by checking types, local variable indices, function
references, return values, and control flow structure in a single
pass. The decoder produces a TurboFan graph which is processed by
various optimization passes and finally turned into machine code by
the same backend which generates machine code for optimized JavaScript
and asm.js. In the next few months, the team will concentrate on
improving the startup time of the V8 implementation through compiler
tuning, parallelism, and compilation policy improvements.
To expand on the idea for a more general audience:
Typical top-down optimization involves high level programming and then execution profiling to identify which pieces of code require more effort. This is true whether the optimization is automated code generation or manual coding of optimized code. In the case of dynamically typed languages you'll frequently want to go beyond just optimizing dynamically-typed algorithms and provide code specialized for specific static types. This is, in fact, what the V8 JIT optimizer does automatically. If humans want to manually provide some particularly 'hot' specialized cases, they'd need to inform the automated optimizer, somehow, that they have already done the work so the automated optimizer can incorporate the manually optimized code rather than automatically generating suboptimal code.
No, that's not possible, and it's highly unlikely that it ever will be, given that it would probably require piercing all sorts of abstraction barriers within the system. The complexity would be enormous, and the effect on maintainability and security would probably be severe.
The web interface to WebAssembly modules (through the Wasm object) provides a clean and simple way to interface between JS and Wasm. In the future, ES6 modules might simplify interop further. It's not obvious what advantage a complicated mechanism like you propose would have over that.
For 1. you can play with the following flags:
trace_turbo: trace generated TurboFan IR
trace_turbo_graph: trace generated TurboFan graphs
trace_turbo_cfg_file: trace turbo cfg graph (for C1 visualizer) to a given file name
trace_turbo_types: trace TurboFan's types
trace_turbo_scheduler: trace TurboFan's scheduler
trace_turbo_reduction: trace TurboFan's various reducers
trace_turbo_jt: trace TurboFan's jump threading
trace_turbo_ceq: trace TurboFan's control equivalence
turbo_stats: print TurboFan statistics
They may change in future versions of V8 and aren't a stable API.
TurboFan is pretty complicated in that it consumes information from the baseline JIT / the interpreter, and may get to that information after deopt. The compiler isn't always a straight pipeline from JS / wasm to assembly. Inlining and a bunch of other things affect what happens.
For 2.: write wasm code or valid asm.js in the first place.
We've discussed performing a bunch of different types of dynamic tracing, caching traces (and allowing injection of traces for testing), but that's probably not something we'd expose considering that there's already a way to give the compiler precise type information!

Hack the JVM to avoid unnecessary bounds checks and casts

There are some languages that support a sufficiently powerful type system that they can prove at compile time that the code does not address an array outside its bounds. My question is that if we were to compile such a language to the JVM, is there some way we could take advantage of that for performance and remove the array bounds checks that occur on every array access?
1) I know that recent JDK supports some array bound check elimination, but since I know at compile time that certain calls are safe, I could remove a lot more safely.
2) Some might think this doesn't affect performance much but it most certainly does, especially in array/computation heavy applications such as scientific computing.
The same question regarding casting. I know something is a certain type, but Java doesn't because its limited type system. Is there some way to just tell the JVM to "trust me" and skip any checks?
I realize there is probably no way to do this as the JVM is generally distributed, could it be reasonable to modify a JVM with this feature? Is this something that has been done?
It's one of the frustrations in compiling a more powerfully typed language to the JVM, it still is hampered by Java's limitations.
In principle this cannot be done in a safe fashion without a proof-carrying code (PCC) infrastructure. PCC would allow you to embed your reasoning of safety in the class file. Your embedded proof is checked at class-loading time. The class is not loaded if there is a flaw in the proof.
If the JVM ever allowed you to drop runtime checks without requiring a formal proof, then, as SecurityMatt put it, it would defeat the original philosophy of Java as a safe platform.
The JVM uses a special form of PCC for type-checking local variables in a method. All local variable typing info is used by the class-loading mechanism to check its correctness, but discarded after that. But that's the only instance of PCC concepts used in the JVM. As far as I know there is no general PCC infrastructure for the JVM.
I once heard one existed for the JavaCard platform which supports a small subset of Java. I am not sure if that can be helpful in your problem though.
One of the key features of Java is that it does not need to "trust" the developer to do bounds checking. This eliminates the "buffer overflow" security vulnerabilities which can lead to attackers being able to execute arbitrary code within your application.
By allowing developers the ability to turn off bounds checking, Java would lose one of its key features - that no matter how wrong the Java developer is, there is not going to be any exploitable buffer overflows within his/her code.
If you would like to use a language where the programmer is trusted to manage their own bounds checking, might I suggest C++. This gives you the ability to allocate arrays with no automatic bounds checking (new int[]) and to allocate arrays with inbuilt bounds checking (std::vector).
Additionally, I strongly suggest that before blaming bounds checking for the speed loss in your application, you perform some BENCHMARKING to determine whether there is somewhere else in your code that might be causing the bottleneck.
You may find that for a compiler target that a bytecode language such as MSIL is more suited to your needs than Java bytecode. MSIL is strongly typed and does not suffer from a number of the inefficiencies that you have found in Java.

What levels should static analyzers analyze?

I've noticed that some static analyzers operate on source code, while others operate on bytecode (e.g., FindBugs). I'm sure there are even some that work on object code.
My question is a simple one, what are the advantages and disadvantages of writing different kinds of static analyzers for different levels of analysis?
Under "static analyzers" I'm including linters, bug finders, and even full-blown verifiers.
And by levels of analysis I would include source code, high-level IRs, low-level IRs, bytecode, object code, and compiler plugins that have access to all phases.
These different facets can influence the level at which an analyzer may decide to work:
Designing a static analyzer is a lot of work. It would be a shame not to factor this work for several languages compiled to the same bytecode, especially when the bytecode retains most of the structure of the source program: Java (FindBugs), .NET (various tools related to Code Contracts). In some cases, the common target language was made up for the purpose of analysis although the compilation scheme wasn't following this path.
Related to 1, you may hope that your static analyzer will be a little less costly to write if it works on a normalized version of the program with a minimum number of constructs. When authoring static analyzers, having to write the treatment for repeat until when you have already written while do is a bother. You may structure your analyzer so that several functions are shared for these two cases, but the care-free way to handle this is to translate one to the other, or to translate the source to an intermediate language that only has one of them.
On the other hand as already pointed out in Flash Sheridan's answer, source code contains the most information. For instance, in languages with fuzzy semantics, bugs at the source level may be removed by compilation. C and C++ have numerous "undefined behaviors" where the compiler is allowed to do anything, including generating a program that works accidentally. Fine, you might think, if the bug is not in the executable it's not a problematic bug. But when you ever re-compile the program for another architecture or with the next version of the compiler, the bug may appear again. This is one reason for not doing the analysis after any phase that might potentially remove bugs.
Some properties can only be checked with reasonable precision on compiled code. That includes absence of compiler-introduced bugs as pointed out again by Flash Sheridan, but also worst-case execution time. Similarly, many languages do not let you know what floating-point code does precisely unless you look at the assembly generated by the compiler (this is because existing hardware does not make it convenient for them to guarantee more). The choice is then to write an imprecise source-level analyzer that takes into account all possibilities, or to analyze precisely one particular compilation of a floating-point program, as long as it is understood that it is that precise assembly code that will be executed.
Source code analysis is the most generally useful, of course; sometimes heuristics even need to analyze comments or formatting. But you’re right that even object code analysis can be necessary, e.g., to detect bugs introduced by GCC misfeatures. Thomas Reps, head of GrammaTech and a Wisconsin professor, gave a good talk on this at Stanford a couple of years ago: http://pages.cs.wisc.edu/~reps/#TOPLAS-WYSINWYX.

Questions about possible java(or other memory managed language) optimizations

From what I have read java (usually) seems to compile java to not very (is at all?) optimised java bytecode, leaving it to the jit to optimise. Is this true? And if it is has there been any exploration (possibly in alternative implementations) of getting the compiler to optimise the code so the jit has less work to do (is this possible)?
Also many people seem to have a dislike for native code generation (sometimes referred to as ahead of time compilation) for Java (and many other high level memory managed languages) , for many reasons such as loss of portability (and ect.) , but also partially because (at least for those languages that have a just in time compiler) the thinking goes that ahead of time compilation to machine code will miss the possible optimisations that can be done by a jit compiler and therefore may be slower in the long run.
This leads me to wonder whether anyone has ever tried to implement http://en.wikipedia.org/wiki/Profile-guided_optimization (compiling to a binary + some extras then running the program and analysing the runtime information of the test run to generate a hopefully more optimised binary for real world usage) for java/(other memory managed languages) and how this would compare to jit code? Anyone have a clue?
Personally, I think the big difference is not between JIT compiling and AOT compiling, but between class-compilation and whole-program optimization.
When you run javac, it only looks at a single .java file, compiling it into a single .class file. All the interface implementations and virtual methods and overrides are checked for validity but left unresolved (because it's impossible to know the true method invocation targets without analyzing the whole program).
The JVM uses "runtime loading and linking" to assemble all of your classes into a coherent program (and any class in your program can invoke specialized behavior to change the default loading/linking behavior).
But then, at runtime, the JVM can remove the vast majority of virtual methods. It can inline all of your getters and setters, turning them into raw fields. And when those raw fields are inlined, it can perform constant-propagation to further optimize the code. (At runtime, there's no such thing as a private field.) And if there's only one thread running, the JVM can eliminate all synchronization primitives.
To make a long story short, there are a lot of optimizations that aren't possible without analyzing the whole program, and the best time for doing whole program analysis is at runtime.
Profile-guided optimization has some caveats, one of them mentioned even in the Wiki article you linked. It's results are valid
for the given samples, representing how your code is actually used by the user or other code.
for the given platform (CPU, memory + other hardware, OS, whatever).
From the performance point of view there are quite big differences even among platforms that are usually considered (more or less) the same (e.g. compare a single core, old Athlon with 512M with a 6 core Intel with 8G, running on Linux, but with very different kernel versions).
for the given JVM and its config.
If any of these change then your profiling results (and the optimizations based on them) are not necessary valid any more. Most likely some of the optimizations will still have a beneficial effect, but some of them may turn out suboptimal (or even degrading performance).
As it was mentioned the JIT JVMs do something very similar to profiling, but they do it on the fly. It's also called 'hotspot', because it constantly monitors the executed code, looks for hot spots that are executed frequently and will try to optimize only those parts. At this point it will be able to exploit more knowledge about the code (knowing the context of it, how it is used by other classes, etc.) so - as mentioned by you and the other answers - it can do better optimizations as a static one. It will continue monitoring and if its needed it will do another turn of optimization later, this time trying even harder (looking for more, more expensive optimizations).
Working on the real life data (usage statistics + platform + config) it can avoid the caveats mentioned before.
The price of it is some additional time it needs to spend on "profiling" + JIT-ing. Most of the time its spent quite well.
I guess a profile-guided optimizer could still compete with it (or even beat it), but only in some special cases, if you can avoid the caveats:
you are quite sure that your samples represent the real life scenario well and they won't change too much during execution.
you know your target platform quite precisely and can do the profiling on it.
and of course you know/control the JVM and its config.
It will happen rarely and I guess in general JIT will give you better results, but I have no evidence for it.
Another possibility for getting value from the profile-guided optimization if you target a JVM that can't do JIT optimization (I think most small devices have such a JVM).
BTW one disadvantage mentioned in other answers would be quite easy to avoid: if static/profile guided optimization is slow (which is probably the case) then do it only for releases (or RCs going to testers) or during nightly builds (where time does not matter so much).
I think the much bigger problem would be to have good sample test cases. Creating and maintaining them is usually not easy and takes a lot of time. Especially if you want to be able to execute them automatically, which would be quite essential in this case.
The official Java Hot Spot compiler does "adaptive optimisation" at runtime, which is essentially the same as the profile-guided optimisation you mentioned. This has been a feature of at least this particular Java implementation for a long time.
The trade-off to performing more static analysis or optimisation passes up-front at compile time is essentially the (ever-diminishing) returns you get from this extra effort against the time it takes for the compiler to run. A compiler like MLton (for Standard ML) is a whole-program optimising compiler with a lot of static checks. It produces very good code, but becomes very, very slow on medium-to-large programs, even on a fast system.
So the Java approach seems to be to use JIT and adaptive optimisation as much as possible, with the initial compilation pass just producing an acceptable valid binary. The absolute opposite end is to use an approach like that of something like MLKit, which does a lot of static inference of regions and memory behaviour.

Resources