Overhead when calling a component function vs inline code - ColdFusion - performance

I've been diagnosing a performance issue with generating a CSV containing around 50,000 lines and I've narrowed it down to a single function that is used once per line.
After a lot of messing about, I've discovered that there is an overhead in using the function, rather than placing the logic directly in the loop - my question is: Why?!
The function in question is very simple, it accepts a string argument and passes that to a switch/case block containing around 15 options - returning the resulting string.
I've put a bunch of timers all over the place and discovered that a lot (not all) of the time this function call takes between 0 and 200 ms to run... however if I put the exact same code inline, it sits at 0 on every iteration.
All this points to a fundamental issue in my understanding of object instantiation and I'd appreciate some clarification.
I was always under the impression that if I instantiate a component at the top of a page, or indeed if I instantiate it in a persistent scope like Application or Session, then it would be placed into memory and subsequent calls to functions within that component would be lightning fast.
It seems however, that there is an overhead to calling these functions and while we're only talking a few milliseconds, when you have to do that 50,000 times it quickly adds up.
Furthermore, it seems that doing this consumes resources. I'm not particularly well versed in the way the JVM uses memory, I've read up on it and played with settings and such, but it's an overwhelming topic - especially for those of us with no Java development experience. It seems that when calling the method over inline code, sometimes the ColdFusion service just collapses and the request never ends. Other times it does indeed complete, although way too slowly. This suggests that the request can complete only when the server has the resources to handle it - and thus that the method call itself is consuming memory... (?)
If indeed the calling of a method has an overhead attached, I have a big problem. It's not really feasible to move all of this code inline, (while the function in question is simple, there are plenty of other functions that I will need to make use of) and doing so goes against everything I believe as a developer!!
So, any help would be appreciated.
Just for clarity and because I'm sure someone will ask for it, here's the code in question:
EDIT: As suggested, I've changed the code to use a struct lookup rather than CFSwitch - below is amended code for reference, however there's also a test app in pastebin links at the bottom.
Inside the init method:
<cfset Variables.VehicleCategories = {
'T1' : 'Beetle'
, 'T1C' : 'Beetle Cabrio'
, 'T2' : 'Type 2 Split'
, 'T2B' : 'Type 2 Bay'
, 'T25' : 'Type 25'
, 'Ghia' : 'Karmann Ghia'
, 'T3' : 'Type 3'
, 'G1' : 'MK1 Golf'
, 'G1C' : 'MK1 Golf Cabriolet'
, 'CADDY' : 'MK1 Caddy'
, 'G2' : 'MK2 Golf'
, 'SC1' : 'MK1/2 Scirocco'
, 'T4' : 'T4'
, 'CO' : 'Corrado'
, 'MISC' : 'MISC'
} />
Function being called:
<cffunction name="getCategory" returntype="string" output="false">
<cfargument name="vehicleID" required="true" type="string" hint="Vehicle type" />
<cfscript>
if (structKeyExists(Variables.VehicleCategories, Arguments.VehicleID)) {
return Variables.VehicleCategories[Arguments.VehicleID];
}
else {
return 'Base SKUs';
}
</cfscript>
</cffunction>
As requested, I've created a test application to replicate this issue:
http://pastebin.com/KE2kUwEf - Application.cfc
http://pastebin.com/X8ZjL7D7 - TestCom.cfc (Place in 'com' folder outside webroot)
http://pastebin.com/n8hBLrfd - index.cfm

Function call will always be slower than inline code in Any language. That's why there's inline keyword in C++, and in JVM land there is JIT optimizer that will inline functions for you if it deems necessary.
Now ColdFusion is yet another layer on top of JVM. Therefore a function in CF is not a function in JVM, so things don't translate 1:1 at the JIT optimizer standpoint. A CFML function is actually compiled down to a Java class. Also, scopes like arguments, local (Java hashtables) are created on every invocation. Those takes time and memory and therefore overhead.
...if I instantiate it in a persistent scope like Application or
Session, then it would be placed into memory and subsequent calls to
functions within that component would be lightning fast
It'd be faster than instantiating a new instance for sure, but it's not going to be "lightning fast" especially when you call it in a tight loop.
In conclusion, inline the function and if it's still not fast enough, locate the slowest part of the code and write it in Java.

Just a side note here, since Railo uses inner classes instead of complete independent classes, it is faster if you write in such a style as to have many small functions. In my experiments, both engines perform similarly with basic inline code. Adobe ColdFusion lends itself to large god functions if you need to squeak out performance under load. With the JVM being unable to inline ColdFusion functions during compilation, you'll never get the benefit of the compiler being smart with your code.
This is especially important if you created an application that uses a ton of explicit getters/setters and you find your traffic increasing from small volume to high volume. All those little functions will bring you to your knees vs. having fewer large "god" functions.
Slowest to fastest with one basic test we ran of 100,000 iterations:
Adobe ColdFusion (many small functions) ( 200X slower than Java)
Railo (many small functions) (60X slower)
ColdFusion / Railo (all code inline in one giant function) (10X slower)
Native Java Class (fastest)

Related

Performance implications of function calls in PSM1 Modules

I have a function that does a find/replace on text files, and it has worked well for some time. Until I needed to process a 12 million line file.
My initial code used Get-Content and Write-Content, and with the massive file it was going to take hours to process, not to mention the memory implications of loading 12 million lines into RAM.
So, I wrote a little test script to compare that approach vs Stream Reader/Writer. And Streaming looked like it was going to be a massive performance improvement, dropping processing to 30 seconds. I then added a .Replace() on each line, and total processing time only went up to maybe a minute. All good. So then I went to implement it in my real code, and performance has tanked again. That code is a PS1 that loads a number of PSM1 files. The function to do the find replace is in one of those PSM1 files, and that code calls functions in another PSM1. The test script was everything in a single small PS1.
Given that my test script didn't use a function call at all, I tested that first, so there is a function in the PS1 that gets called 12 million times from the loop in the same PS1. No real performance impact.
So, my thought then was that calling a function in one PSM1 that then calls a function in another PSM1 (12 million times) might be the issue. So I made a dummy function (which just returns the passed string, as if no replacement was needed) in the same PSM1 as the loop. And that is orders of magnitude slower.
I have not tested this with everything in the PS1, mostly because these functions are needed in three different scripts with very different argument requirements, so implementing it with Modules really made a lot of sense logistically, and changing that would be a massive undertaking.
That said, is there a known performance hit when calling a function that lives in a Module? I was under the impression that once the Modules are loaded, it's basically the same as if it was all in a single PS1, but perhaps not? FWIW, I am not using NameSpaces. All of my functions just have function name prefix on the noun side to avoid conflicts.
I also can't really post minimally functional code very easily since that's in a single file that doesn't exhibit the behavior. If there is no obvious answer to someone I guess my next step is to implement the test script with some modules, but that's not really apples to apples either, since my real modules are rather large.
To add a little context: When the function (in a PSM1) does not call a function and simply sets $writeLine = $originalLine total time is 15 seconds.
When doing an actual find and replace inline (no call to a function) like this $writeLine = $originalLine.Replace($replace, $with) total processing time is 16 seconds.
When calling a function in the same PSM1 that just returns the original string total time is 17 minutes.
But again, when it's all in a PS1 file with no modules, calling a function has minimal impact. So it certainly seems like calling a function in a PSM1, even from a function in that same PSM1, has a massive performance overhead.
And more context:
I moved the replace function in the test script into a Module. No appreciable change. So I moved the main code, including the loop, into a function in that module, and called it from the main script. Again, no real change. Both took around 15 seconds.
So, it's not something innate in Modules. That then begs the question, what could I be doing in my other modules that would trigger this behavior? This modules are 3000-10,000 lines of code, so there is a lot going on. Hopefully someone has some insight as to best practices with modules to mitigate this. And hopefully it's not "Don't use big modules". ;)
Final update:
It seems it IS a function of how big the module is. I deleted all the other functions in the Module that contains the loop, and performance is fine, 17 seconds. So, basically even as of PS5.0, the implementation of modules is pretty useless for anything large. Rather disconcerting. I wonder if the same would be true if all the functions where in a single file, and PowerShell performance with large files with lots of functions is just bad? Anyone have any experience down this road?

First call to Julia is slow

This questions deals with first load performance of Julia
I am running a Julia program from command line. The program is intended to be an API where user of the API doesn't have to initialize internal objects.
module Internal
type X
c::T1
d::T2
.
..
end
function do_something(a::X, arg:Any)
#some function
end
export do_something
end
API.jl
using Internal
const i_of_X = X()
function wrapper_do_something(args::Any)
do_something(i_of_X, args)
end
Now, this API.jl is exposed to third party user so that they don't have to bother about instantiating the internal objects. However, API.jl is not a module and hence cannot be precompiled. As there are many functions in API.jl, the first load takes a "very" long time.
Is there a way to improve the performance? I tried wrapping API.jl in a module too but I don't know if wrapping const initialized variables in a module is the way to go. I also get segmentation fault on doing so (some of the const are database connections and database collections along with other complex objects).
I am using v0.5 on OSX
[EDIT]
I did wrap API.jl in a module but there is no performance improvement.
I digged deeper and a big performance hit comes from the first call to linear regression function (GLM module based OLS lm(y~X, df)). The df has only 2 columns and 3 rows so it's not the run time issues but compilation slowness.
The other big hit comes from calling a highly overloaded function. The overloaded function fetches data from the database and can accept variety of input formats.
Is there a way to speed these up? Is there a way to fully precompile the julia program?
For a little more background, API based program is called once via command-line and any persistent first compilation advantages are lost as command-line closes the Julia process.
$julia run_api_based_main_func.jl
One hacky way to use the compilation benefits is to somehow copy/paste the code in already active julia process. Is this doable/recommended?(I am desperate to make it fast. Waiting 15-20s for a 2s analysis doesn't seem right)
It is OK to wrap const values in a module. They can be exported, as need be.
As Fengyang said, wrapping independent components of a larger design in modules is helpful and will help in this situation. When there is a great deal going on inside a module, the precompile time that accompanies each initial function call can add up. There is a way to avoid that -- precompile the contents before using the module:
__precompile__(true)
module ModuleName
# ...
end # module ModuleName
Please Note (from the online help):
__precompile__() should not be used in a module unless all of its dependencies are also using __precompile__(). Failure to do so can
result in a runtime error when loading the module.

What's the most efficient way to ignore code in lua?

I have a chunk of lua code that I'd like to be able to (selectively) ignore. I don't have the option of not reading it in and sometimes I'd like it to be processed, sometimes not, so I can't just comment it out (that is, there's a whole bunch of blocks of code and I either have the option of reading none of them or reading all of them). I came up with two ways to implement this (there may well be more - I'm very much a beginner): either enclose the code in a function and then call or not call the function (and once I'm sure I'm passed the point where I would call the function, I can set it to nil to free up the memory) or enclose the code in an if ... end block. The former has slight advantages in that there are several of these blocks and using the former method makes it easier for one block to load another even if the main program didn't request it, but the latter seems the more efficient. However, not knowing much, I don't know if the efficiency saving is worth it.
So how much more efficient is:
if false then
-- a few hundred lines
end
than
throwaway = function ()
-- a few hundred lines
end
throwaway = nil -- to ensure that both methods leave me in the same state after garbage collection
?
If it depends a lot on the lua implementation, how big would the "few hundred lines" need to be to reliably spot the difference, and what sort of stuff should it include to best test (the main use of the blocks is to define a load of possibly useful functions)?
Lua's not smart enough to dump the code for the function, so you're not going to save any memory.
In terms of speed, you're talking about a different of nanoseconds which happens once per program execution. It's harming your efficiency to worry about this, which has virtually no relevance to actual performance. Write the code that you feel expresses your intent most clearly, without trying to be clever. If you run into performance issues, it's going to be a million miles away from this decision.
If you want to save memory, which is understandable on a mobile platform, you could put your conditional code in it's own module and never load it at all of not needed (if your framework supports it; e.g. MOAI does, Corona doesn't).
If there is really a lot of unused code, you can define it as a collection of Strings and loadstring() it when needed. Storing functions as strings will reduce the initial compile time, however of most functions the string representation probably takes up more memory than it's compiled form and what you save when compiling is probably not significant before a few thousand lines... Just saying.
If you put this code in a table, you could compile it transparently through a metatable for minimal performance impact on repeated calls.
Example code
local code_uncompiled = {
f = [=[
local x, y = ...;
return x+y;
]=]
}
code = setmetatable({}, {
__index = function(self, k)
self[k] = assert(loadstring(code_uncompiled[k]));
return self[k];
end
});
local ff = code.f; -- code of x gets compiled here
ff = code.f; -- no compilation here
for i=1, 1000 do
print( ff(2*i, -i) ); -- no compilation here either
print( code.f(2*i, -i) ); -- no compile either, but table access (slower)
end
The beauty of it is that this compiles as needed and you don't really have to waste another thought on it, it's just like storing a function in a table and allows for a lot of flexibility.
Another advantage of this solution is that when the amount of dynamically loaded code gets out of hand, you could transparently change it to load code from external files on demand through the __index function of the metatable. Also, you can mix compiled and uncompiled code by populating the "code" table with "real" functions.
Try the one that makes the code more legible to you first. If it runs fast enough on your target machine, use that.
If it doesn't run fast enough, try the other one.
lua can ignore multiple lines by:
function dostuff()
blabla
faaaaa
--[[
ignore this
and this
maybe this
this as well
]]--
end

What is the design rationale behind HandleScope?

V8 requires a HandleScope to be declared in order to clean up any Local handles that were created within scope. I understand that HandleScope will dereference these handles for garbage collection, but I'm interested in why each Local class doesn't do the dereferencing themselves like most internal ref_ptr type helpers.
My thought is that HandleScope can do it more efficiently by dumping a large number of handles all at once rather than one by one as they would in a ref_ptr type scoped class.
Here is how I understand the documentation and the handles-inl.h source code. I, too, might be completely wrong since I'm not a V8 developer and documentation is scarce.
The garbage collector will, at times, move stuff from one memory location to another and, during one such sweep, also check which objects are still reachable and which are not. In contrast to reference-counting types like std::shared_ptr, this is able to detect and collect cyclic data structures. For all of this to work, V8 has to have a good idea about what objects are reachable.
On the other hand, objects are created and deleted quite a lot during the internals of some computation. You don't want too much overhead for each such operation. The way to achieve this is by creating a stack of handles. Each object listed in that stack is available from some handle in some C++ computation. In addition to this, there are persistent handles, which presumably take more work to set up and which can survive beyond C++ computations.
Having a stack of references requires that you use this in a stack-like way. There is no “invalid” mark in that stack. All the objects from bottom to top of the stack are valid object references. The way to ensure this is the LocalScope. It keeps things hierarchical. With reference counted pointers you can do something like this:
shared_ptr<Object>* f() {
shared_ptr<Object> a(new Object(1));
shared_ptr<Object>* b = new shared_ptr<Object>(new Object(2));
return b;
}
void g() {
shared_ptr<Object> c = *f();
}
Here the object 1 is created first, then the object 2 is created, then the function returns and object 1 is destroyed, then object 2 is destroyed. The key point here is that there is a point in time when object 1 is invalid but object 2 is still valid. That's what LocalScope aims to avoid.
Some other GC implementations examine the C stack and look for pointers they find there. This has a good chance of false positives, since stuff which is in fact data could be misinterpreted as a pointer. For reachability this might seem rather harmless, but when rewriting pointers since you're moving objects, this can be fatal. It has a number of other drawbacks, and relies a lot on how the low level implementation of the language actually works. V8 avoids that by keeping the handle stack separate from the function call stack, while at the same time ensuring that they are sufficiently aligned to guarantee the mentioned hierarchy requirements.
To offer yet another comparison: an object references by just one shared_ptr becomes collectible (and actually will be collected) once its C++ block scope ends. An object referenced by a v8::Handle will become collectible when leaving the nearest enclosing scope which did contain a HandleScope object. So programmers have more control over the granularity of stack operations. In a tight loop where performance is important, it might be useful to maintain just a single HandleScope for the whole computation, so that you won't have to access the handle stack data structure so often. On the other hand, doing so will keep all the objects around for the whole duration of the computation, which would be very bad indeed if this were a loop iterating over many values, since all of them would be kept around till the end. But the programmer has full control, and can arrange things in the most appropriate way.
Personally, I'd make sure to construct a HandleScope
At the beginning of every function which might be called from outside your code. This ensures that your code will clean up after itself.
In the body of every loop which might see more than three or so iterations, so that you only keep variables from the current iteration.
Around every block of code which is followed by some callback invocation, since this ensures that your stuff can get cleaned if the callback requires more memory.
Whenever I feel that something might produce considerable amounts of intermediate data which should get cleaned (or at least become collectible) as soon as possible.
In general I'd not create a HandleScope for every internal function if I can be sure that every other function calling this will already have set up a HandleScope. But that's probably a matter of taste.
Disclaimer: This may not be an official answer, more of a conjuncture on my part; but the v8 documentation is hardly
useful on this topic. So I may be proven wrong.
From my understanding, in developing various v8 based backed application. Its a means of handling the difference between the C++ and javaScript environment.
Imagine the following sequence, which a self dereferencing pointer can break the system.
JavaScript calls up a C++ wrapped v8 function : lets say helloWorld()
C++ function creates a v8::handle of value "hello world =x"
C++ returns the value to the v8 virtual machine
C++ function does its usual cleaning up of resources, including dereferencing of handles
Another C++ function / process, overwrites the freed memory space
V8 reads the handle : and the data is no longer the same "hell!#(#..."
And that's just the surface of the complicated inconsistency between the two; Hence to tackle the various issues of connecting the JavaScript VM (Virtual Machine) to the C++ interfacing code, i believe the development team, decided to simplify the issue via the following...
All variable handles, are to be stored in "buckets" aka HandleScopes, to be built / compiled / run / destroyed by their
respective C++ code, when needed.
Additionally all function handles, are to only refer to C++ static functions (i know this is irritating), which ensures the "existence"
of the function call regardless of constructors / destructor.
Think of it from a development point of view, in which it marks a very strong distinction between the JavaScript VM development team, and the C++ integration team (Chrome dev team?). Allowing both sides to work without interfering one another.
Lastly it could also be the sake of simplicity, to emulate multiple VM : as v8 was originally meant for google chrome. Hence a simple HandleScope creation and destruction whenever we open / close a tab, makes for much easier GC managment, especially in cases where you have many VM running (each tab in chrome).

How far should I go to avoid internal getters/setters within a class

I have more of a "how much is too much" question. I have a Java class that defines several getters/setters for use by external classes (about 30 altogether). However, the Java class itself requires the use of these variables as well in some cases.
I understand the concept of using member fields instead of the getter methods within a class, but the getters in this case perform a function (unmasking an integer to be specific) to create the value to be returned.
So from a performance and memory reduction perspective, for the few calls within the class that need those values, I'm curious if I should...
a. Just call the getter
b. Do the unmasking wherever I need the values throughout the class, just like the getter
c. Create variables to hold those values, load these up by calling all the getters on startup, and use those within the class (30 or so integers may not be a serious memory risk, but I would also need to add to my code to keep those updated if a user sets new values...since the value is updated and masked).
Any thoughts are appreciated!
A. Just call the getter.
From a performance and memory reduction perspective, there is really little to no impact here on constantly re-using the same functions. That's what code reuse is all about.
From a high level execution/performance view, we do something like this:
code: myGetter()
program : push the program state (very few cycles)
program : jump to mygetter (1 clock cycle)
program : execute mygetter (don't know your code but probably very few cycles)
program : save the result ( 1 clock cycle)
program : pop the program state ( very few cycles )
program : continue to next line of code ( 1 clock cycle)
In performance, the golden rule of thumb is to spend your time optimizing what really makes a difference. For all general purposes, disk I/O takes up the most time and resources.
Hope this helps!
a) Call the getter - as you pointed out it's the right and clean way in your case.
b) and c) would be premature optimization and most likely do more harm than good (unless you REALLY know that this particular spot will be a hot spot in your code AND your JIT-compiler will not be able to optimize it for you).
If you really hit performance problems at some point, then profile the application and optimize only hot spots manually.
Don't Repeat Yourself is the guiding principle here. Trying to save a function call by repeating the same unmasking code throughout a class is a recipe for disaster. I would just call the getter within the class.

Resources