Is there a way to speed up the library loading in R? - performance

I have a Rscript that will load ggplot2 in its first line.
Though loading a library doesn't take much time, as this script may be executed in command line for millions of times so the speed is really important for me.
Is there a way to speed up this loading process?

Don't restart -- keep a persistent R session and just issue requests to it. Something like Rserve can provide this, and for example FastRWeb uses it very well -- with millsecond round-trips for chart generation.

As an addition to #MikeDunlavey's answer:
Actually, both library and require check whether the package is already loaded.
Here are some timings with microbenchmark I get:
> microbenchmark (`!` (exists ("qplot")),
`!` (existsFunction ('qplot')),
require ('ggplot2'),
library ('ggplot2'),
"package:ggplot2" %in% search ())
## results reordered with descending median:
Unit: microseconds
expr min lq median uq max
3 library("ggplot2") 259.720 262.8700 266.3405 271.7285 448.749
1 !existsFunction("qplot") 79.501 81.8770 83.7870 89.2965 114.182
5 require("ggplot2") 12.556 14.3755 15.5125 16.1325 33.526
4 "package:ggplot2" %in% search() 4.315 5.3225 6.0010 6.5475 9.201
2 !exists("qplot") 3.370 4.4250 5.0300 6.2375 12.165
For comparison, loading for the first time:
> system.time (library (ggplot2))
User System verstrichen
0.284 0.016 0.300
(these are seconds!)
In the end, as long as the factor 3 = 10 μs between require and "package:ggplot2" %in% search() isn't needed, I'd go with require, otherwise witht the %in% search ().

What Dirk said, plus you can use the exists function to conditionally load a library, as in
if ( ! exists( "some.function.defined.in.the.library" )){
library( the.library )
}
So if you put that in the script you can run the script more than once in the same R session.

Related

GPROF flat profile empty result

I'm trying to use gprof (I have to use gprof - no other option is available) when I get the flat file the result is empty even though everything works fine.
By the way, the code is in c so I'm using gcc.
Result:
Each sample counts as 0.01 seconds.
no time accumulated
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
What can I do to fix this? Thank you.
You can use gcc -no-pie option as a workaround.

Is it better to declare a local inside or outside a loop? [duplicate]

This question already has an answer here:
In Lua, should I define a variable every iteration of a loop or before the loop?
(1 answer)
Closed 6 years ago.
I am used to do this:
do
local a
for i=1,1000000 do
a = <some expression>
<...> --do something with a
end
end
instead of
for i=1,1000000 do
local a = <some expression>
<...> --do something with a
end
My reasoning is that creating a local variable 1000000 times is less efficient than creating it just once and reuse it on each iteration.
My question is: is this true or there is another technical detail I am missing? I am asking because I don't see anyone doing this but not sure if the reason is because the advantage is too small or because it is in fact worse. By better I mean using less memory and running faster.
Like any performance question, measure first.
In a unix system you can use time:
time lua -e 'local a; for i=1,100000000 do a = i * 3 end'
time lua -e 'for i=1,100000000 do local a = i * 3 end'
the output:
real 0m2.320s
user 0m2.315s
sys 0m0.004s
real 0m2.247s
user 0m2.246s
sys 0m0.000s
The more local version appears to be a small percentage faster in Lua, since it does not initialize a to nil. However, that is no reason to use it, use the most local scope because it it is more readable (this is good style in all languages: see this question asked for C,
Java, and C#)
If you are reusing a table instead of creating it in the loop then there is likely a more significant performance difference. In any case, measure and favour readability whenever you can.
I think there's some confusion about the way compilers deal with variables. From a high-level kind of human perspective, it feels natural to think of defining and destroying a variable to have some kind of "cost" associated with it.
Yet that's not necessarily the case to the optimizing compiler. The variables you create in a high-level language are more like temporary "handles" into memory. The compiler looks at those variables and then translates it into an intermediate representation (something closer to the machine) and figures out where to store everything, predominantly with the goal of allocating registers (the most immediate form of memory for the CPU to use). Then it translates the IR into machine code where the idea of a "variable" doesn't even exist, only places to store data (registers, cache, dram, disk).
This process includes reusing the same registers for multiple variables provided that they do not interfere with each other (provided that they are not needed simultaneously: not "live" at the same time).
Put another way, with code like:
local a = <some expression>
The resulting assembly could be something like:
load gp_register, <result from expression>
... or it may already have the result from some expression in a register, and the variable ends up disappearing completely (just using that same register for it).
... which means there's no "cost" to the existence of the variable. It just translates directly to a register which is always available. There's no "cost" to "creating a register", as registers are always there.
When you start creating variables at a broader (less local) scope, contrary to what you think, you may actually slow down the code. When you do this superficially, you're kind of fighting against the compiler's register allocation, and making it harder for the compiler to figure out what registers to allocate for what. In that case, the compiler might spill more variables into the stack which is less efficient and actually has a cost attached. A smart compiler may still emit equally-efficient code, but you could actually make things slower. Helping the compiler here often means more local variables used in smaller scopes where you have the best chance for efficiency.
In assembly code, reusing the same registers whenever you can is efficient to avoid stack spills. In high-level languages with variables, it's kind of the opposite. Reducing the scope of variables helps the compiler figure out which registers it can reuse because using a more local scope for variables helps inform the compiler which variables aren't live simultaneously.
Now there are exceptions when you start involving user-defined constructor and destructor logic in languages like C++ where reusing an object might prevent redundant construction and destruction of an object that can be reused. But that doesn't apply in a language like Lua, where all variables are basically plain old data (or handles into garbage-collected data or userdata).
The only case where you might see an improvement using less local variables is if that somehow reduces work for the garbage collector. But that's not going to be the case if you simply re-assign to the same variable. To do that, you would have to reuse whole tables or user data (without re-assigning). Put another way, reusing the same fields of a table without recreating a whole new one might help in some cases, but reusing the variable used to reference the table is very unlikely to help and could actually hinder performance.
All local variables are "created" at compile (load) time and are simply indexes into function activation record's locals block. Each time you define a local, that block is grown by 1. Each time do..end/lexical block is over, it shrinks back. Peak value is used as total size:
function ()
local a -- current:1, peak:1
do
local x -- current:2, peak:2
local y -- current:3, peak:3
end
-- current:1, peak:3
do
local z -- current:2, peak:3
end
end
The above function has 3 local slots (determined at load, not at runtime).
Regarding your case, there is no difference in locals block size, and moreover, luac/5.1 generates equal listings (only indexes change):
$ luac -l -
local a; for i=1,100000000 do a = i * 3 end
^D
main <stdin:0,0> (7 instructions, 28 bytes at 0x7fee6b600000)
0+ params, 5 slots, 0 upvalues, 5 locals, 3 constants, 0 functions
1 [1] LOADK 1 -1 ; 1
2 [1] LOADK 2 -2 ; 100000000
3 [1] LOADK 3 -1 ; 1
4 [1] FORPREP 1 1 ; to 6
5 [1] MUL 0 4 -3 ; - 3 // [0] is a
6 [1] FORLOOP 1 -2 ; to 5
7 [1] RETURN 0 1
vs
$ luac -l -
for i=1,100000000 do local a = i * 3 end
^D
main <stdin:0,0> (7 instructions, 28 bytes at 0x7f8302d00020)
0+ params, 5 slots, 0 upvalues, 5 locals, 3 constants, 0 functions
1 [1] LOADK 0 -1 ; 1
2 [1] LOADK 1 -2 ; 100000000
3 [1] LOADK 2 -1 ; 1
4 [1] FORPREP 0 1 ; to 6
5 [1] MUL 4 3 -3 ; - 3 // [4] is a
6 [1] FORLOOP 0 -2 ; to 5
7 [1] RETURN 0 1
// [n]-comments are mine.
First note this: Defining the variable inside the loop makes sure that after one iteration of this loop the next iteration cannot use that same stored variable again. Defining it before the for-loop makes it possible to carry an variable through multiple iterations, as any other variable not defined within the loop.
Further, to answer your question: Yes, it is less efficient, because it re-initiates the variable. If the Lua JIT- /Compiler has good pattern recognition, it might be that it just resets the variable, but I cannot confirm nor deny that.

Best way to measure elapsed time in Scheme

I have some kind of "main loop" using glut. I'd like to be able to measure how much time it takes to render a frame. The time used to render a frame might be used for other calculations. The use of the function time isn't adequate.
(time (procedure))
I found out that there is a function called current-time. I had to import some package to get it.
(define ct (current-time))
Which define ct as a time object. Unfortunately, I couldn't find any arithmetic packages for dates in scheme. I saw that in Racket there is something called current-inexact-milliseconds which is exactly what I'm looking for because it has nanoseconds.
Using the time object, there is a way to convert it to nanoseconds using
(time->nanoseconds ct)
This lets me do something like this
(let ((newTime (current-time)))
(block)
(print (- (time->nanoseconds newTime) (time->nanoseconds oldTime)))
(set! oldTime newTime))
Seems good enough for me except that for some reasons it was printing things like this
0
10000
0
0
10000
0
10000
I'm rendering things using opengl and I find it hard to believe that some rendering loop are taking 0 nanoseconds. And that each loop is quite stable enough to always take the same amount of nanoseconds.
After all, your results are not so surprising because we have to consider the limited timer resolution for each system. In fact, there are some limits that depend in general by the processor and by the OS processes. These are not able to count in an accurate manner than we expect, despite a quartz oscillator can reach and exceed a period of a nanosecond. You are also limited by the accuracy and resolution of the functions you used. I had a look at the documentation of Chicken scheme but there is nothing similar to (current-inexact-milliseconds) → real? of Racket.
After digging around, I came with the solution that I should write it in C and bind it to scheme using bindings.
(require-extension bind)
(bind-rename "getTime" "current-microseconds")
(bind* #<<EOF
uint64_t getTime();
#ifndef CHICKEN
#include <sys/time.h>
uint64_t getTime() {
struct timeval tim;
gettimeofday(&tim, NULL);
return 1000000 * tim.tv_sec + tim.tv_usec;
}
#endif
EOF
)
Unfortunately this solution isn't the best because it will be chicken-scheme only. It could be implemented as a library but a library to wrap only one function that doesn't exists on any other scheme doesn't make sense.
Since nanoseconds doens't actually make much sense after all, I got the microseconds instead.
Watch the trick here, define the function to get wrapped above and prevent the include to get parsed by bind. When the file will get loaded in Gcc, it will build with the include and function definition.
CHICKEN has current-milliseconds: http://api.call-cc.org/doc/library/current-milliseconds

Luaj os.time() return milliseconds

os.time() in Luaj returns time in milliseconds, but according to lua documentation, it should return time in seconds.
Is this a bug in Luaj?
And Can you suggest a workaround that will work with Luaj(for java) and real Lua(c/c++)? because i have to use the same lua source for both applications.(cant simply divide it with 1000, as they both have return different time scale)
example in my lua file:
local start = os.time()
while(true) do
print(os.time() - start)
end
in c++ , i received output:
1
1
1
...(1 seconds passed)
2
2
2
in java (using Luaj), i got:
1
...(terminate in eclipse as fast as my finger can)
659
659
659
659
fyi, i try this on windows
Yes there's a bug in luaj.
The implementation just returns System.currentTimeMillis() when you call os.time(). It should really return something like (long)(System.currentTimeMillis()/1000.)
It's also worth pointing out that the os.date and os.time handling in luaj is almost completely missing. I would recommend that you assume that they've not been implemented yet.
Lua manual about os.time():
The returned value is a number, whose meaning depends on your system. In POSIX, Windows, and some other systems, this number counts the number of seconds since some given start time (the "epoch"). In other systems, the meaning is not specified, and the number returned by time can be used only as an argument to os.date and os.difftime.
So, any Lua implementation could freely change the meaning of os.time() value.
It appears like you've already confirmed that it's a bug in LuaJ; as for the workaround you can replace os.time() with your own version:
if (runningunderluaj) then
local ostime = os.time
os.time = function(...) return ostime(...)/1000 end
end
where runningunderluaj can check for some global variable that is only set under luaj. If that's not available, you can probably come up with your own check by comparing the results from calls to os.clock and os.time that measure time difference:
local s = os.clock()
local t = os.time()
while true do
if os.clock()-s > 0.1 then break end
end
-- (at least) 100ms has passed
local runningunderluaj = os.time() - t > 1
Note: It's possible that os.clock() is "broken" as well. I don't have access to luaj to test this...
In luaj-3.0-beta2, this has been fixed to return time in seconds.
This was a bug in all versions of luaj up to and including luaj-3.0-beta1.

Improving Postgres psycopg2 query performance for Python to the same level of Java's JDBC driver

Overview
I'm attempting to improve the performance of our database queries for SQLAlchemy. We're using psycopg2. In our production system, we're chosing to go with Java because it is simply faster by at least 50%, if not closer to 100%. So I am hoping someone in the Stack Overflow community has a way to improve my performance.
I think my next step is going to be to end up patching the psycopg2 library to behave like the JDBC driver. If that's the case and someone has already done this, that would be fine, but I am hoping I've still got a settings or refactoring tweak I can do from Python.
Details
I have a simple "SELECT * FROM someLargeDataSetTable" query running. The dataset is GBs in size. A quick performance chart is as follows:
Timing Table
Records | JDBC | SQLAlchemy[1] | SQLAlchemy[2] | Psql
--------------------------------------------------------------------
1 (4kB) | 200ms | 300ms | 250ms | 10ms
10 (8kB) | 200ms | 300ms | 250ms | 10ms
100 (88kB) | 200ms | 300ms | 250ms | 10ms
1,000 (600kB) | 300ms | 300ms | 370ms | 100ms
10,000 (6MB) | 800ms | 830ms | 730ms | 850ms
100,000 (50MB) | 4s | 5s | 4.6s | 8s
1,000,000 (510MB) | 30s | 50s | 50s | 1m32s
10,000,000 (5.1GB) | 4m44s | 7m55s | 6m39s | n/a
--------------------------------------------------------------------
5,000,000 (2.6GB) | 2m30s | 4m45s | 3m52s | 14m22s
--------------------------------------------------------------------
[1] - With the processrow function
[2] - Without the processrow function (direct dump)
I could add more (our data can be as much as terabytes), but I think changing slope is evident from the data. JDBC just performs significantly better as the dataset size increases. Some notes...
Timing Table Notes:
The datasizes are approximate, but they should give you an idea of the amount of data.
I'm using the 'time' tool from a Linux bash commandline.
The times are the wall clock times (i.e. real).
I'm using Python 2.6.6 and I'm running with python -u
Fetch Size is 10,000
I'm not really worried about the Psql timing, it's there just as a reference point. I may not have properly set fetchsize for it.
I'm also really not worried about the timing below the fetch size as less than 5 seconds is negligible to my application.
Java and Psql appear to take about 1GB of memory resources; Python is more like 100MB (yay!!).
I'm using the [cdecimals] library.
I noticed a [recent article] discussing something similar to this. It appears that the JDBC driver design is totally different to the psycopg2 design (which I think is rather annoying given the performance difference).
My use-case is basically that I have to run a daily process (with approximately 20,000 different steps... multiple queries) over very large datasets and I have a very specific window of time where I may finish this process. The Java we use is not simply JDBC, it's a "smart" wrapper on top of the JDBC engine... we don't want to use Java and we'd like to stop using the "smart" part of it.
I'm using one of our production system's boxes (database and backend process) to run the query. So this is our best-case timing. We have QA and Dev boxes that run much slower and the extra query time can become significant.
testSqlAlchemy.py
#!/usr/bin/env python
# testSqlAlchemy.py
import sys
try:
import cdecimal
sys.modules["decimal"]=cdecimal
except ImportError,e:
print >> sys.stderr, "Error: cdecimal didn't load properly."
raise SystemExit
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
def processrow (row,delimiter="|",null="\N"):
newrow = []
for x in row:
if x is None:
x = null
newrow.append(str(x))
return delimiter.join(newrow)
fetchsize = 10000
connectionString = "postgresql+psycopg2://usr:pass#server:port/db"
eng = create_engine(connectionString, server_side_cursors=True)
session = sessionmaker(bind=eng)()
with open("test.sql","r") as queryFD:
with open("/dev/null","w") as nullDev:
query = session.execute(queryFD.read())
cur = query.cursor
while cur.statusmessage not in ['FETCH 0','CLOSE CURSOR']:
for row in query.fetchmany(fetchsize):
print >> nullDev, processrow(row)
After timing, I also ran a cProfile and this is the dump of worst offenders:
Timing Profile (with processrow)
Fri Mar 4 13:49:45 2011 sqlAlchemy.prof
415757706 function calls (415756424 primitive calls) in 563.923 CPU seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.001 0.001 563.924 563.924 {execfile}
1 25.151 25.151 563.924 563.924 testSqlAlchemy.py:2()
1001 0.050 0.000 329.285 0.329 base.py:2679(fetchmany)
1001 5.503 0.005 314.665 0.314 base.py:2804(_fetchmany_impl)
10000003 4.328 0.000 307.843 0.000 base.py:2795(_fetchone_impl)
10011 0.309 0.000 302.743 0.030 base.py:2790(__buffer_rows)
10011 233.620 0.023 302.425 0.030 {method 'fetchmany' of 'psycopg2._psycopg.cursor' objects}
10000000 145.459 0.000 209.147 0.000 testSqlAlchemy.py:13(processrow)
Timing Profile (without processrow)
Fri Mar 4 14:03:06 2011 sqlAlchemy.prof
305460312 function calls (305459030 primitive calls) in 536.368 CPU seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.001 0.001 536.370 536.370 {execfile}
1 29.503 29.503 536.369 536.369 testSqlAlchemy.py:2()
1001 0.066 0.000 333.806 0.333 base.py:2679(fetchmany)
1001 5.444 0.005 318.462 0.318 base.py:2804(_fetchmany_impl)
10000003 4.389 0.000 311.647 0.000 base.py:2795(_fetchone_impl)
10011 0.339 0.000 306.452 0.031 base.py:2790(__buffer_rows)
10011 235.664 0.024 306.102 0.031 {method 'fetchmany' of 'psycopg2._psycopg.cursor' objects}
10000000 32.904 0.000 172.802 0.000 base.py:2246(__repr__)
Final Comments
Unfortunately, the processrow function needs to stay unless there is a way within SQLAlchemy to specify null = 'userDefinedValueOrString' and delimiter = 'userDefinedValueOrString' of the output. The Java we are using currently already does this, so the comparison (with processrow) needed to be apples to apples. If there is a way to improve the performance of either processrow or SQLAlchemy with pure Python or a settings tweak, I'm very interested.
This is not an answer out of the box, with all client/db stuff you may need to do some work to determine exactly what is amiss
backup postgresql.conf changing
log_min_duration_statement to 0
log_destination = 'csvlog' # Valid values are combinations of
logging_collector = on # Enable capturing of stderr and csvlog
log_directory = 'pg_log' # directory where log files are written,
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log' # log file name pattern,
debug_print_parse = on
debug_print_rewritten = on
debug_print_plan output = on
log_min_messages = info (debug1 for all server versions prior to 8.4)
Stop and restart your database server ( reload may not pick up the changes )
Reproduce your tests ensuring that the server time and client times match and that you record the start times etc.
copy the log file off an import into editor of your choice (excel or another spreadsheet can be useful for getting advance manipulation for sql & plans etc)
now examine the timings from the server side and note:
is the sql reported on the server the same in each case
if the same you should have the same timings
is the client generating a cursor rather than passing sql
is one driver doing a lot of casting/converting between character sets or implicit converting of other types such as dates or timestamps.
and so on
The plan data will be included for completeness, this may inform if there are gross differences in the SQL submitted by the clients.
The stuff below is probably aiming above and beyond what you have in mind or what is deemed acceptable in your environment, but I'll put the option on the table just in case.
Is the destination of every SELECT in your test.sql truly a simple |-separated results file?
Is non-portability (Postgres-specificity) acceptable?
Is your backend Postgres 8.2 or newer?
Will the script run on the same host as the database backend, or would it be acceptable to generate the |-separated results file(s) from within the backend (e.g. to a share?)
If the answer to all of the above questions is yes, then you can transform your SELECT ... statements to COPY ( SELECT ... ) TO E'path-to-results-file' WITH DELIMITER '|' NULL E'\\N'.
An alternative could be to use ODBC. This is assuming that Python ODBC driver performs well.
PostgreSQL has ODBC drivers for both Windows and Linux.
As someone who programmed mostly in assembler, there is one thing that sticks out as obvious. You are losing time in the overhead, and the overhead is what needs to go.
Rather than using python, which wraps itself in something else that integrates with something that is a C wrapper around the DB.... just write the code in C. I mean, how long can it take? Postgres is not hard to interface with (quite the opposite). C is an easy langauge. The operations you are performing seem pretty straightforward. You can also use SQL embedded in C, it's just a matter of a pre-compile. No need to translate what you were thinking - just write it there along with the C and use the supplied ECPG compiler (read postgres manual chapter 29 iirc).
Take out as much of the in-betweeny interfacey stuff as you can, cut out the middle man and get talking to the database natively. It seems to me that in trying to make the system simpler you are actually making it more complicated than it needs to be. When things are getting really messy, I usually ask myself the question "What bit of the code am I most afraid to touch?" - that usually points me to what needs changing.
Sorry for babbling on, but maybe a step backward and some fresh air will help ;)

Resources