How can I determine why my Racket code runs so slowly? - performance

Just for fun, I wrote a quick Racket command-line script to parse old Unix fortune files. Fortune files are just giant text files, with a single % on a blank line separating entries.
Just as a quick first hack, I wrote the following Racket code:
(define fortunes
(with-input-from-file "fortunes.txt"
(λ ()
(regexp-split #rx"%" (port->string)))))
I thought it would run nearly instantly. Instead, it takes a very long time to run—on the order of a couple of minutes. In comparison, what I think of as equivalent Python:
with open('fortunes.txt') as f:
fortunes = f.read().split('%')
executes immediately, with equivalent results to the Racket code.
What am I doing wrong here? Yes, there's some obvious low-hanging fruit, such as I'm sure that things would be better if I didn't slurp the whole file into RAM with port->string, but the behavior is so pathologically bad I feel as if I must be doing something stupid at a much higher level than that.
Is there a more Racket-like way to do this with equivalently better performance? Is Racket I/O really poor for some operations? Is there some way to profile my code slightly deeper than the naïve profiler in DrRacket so I can figure out what about a given line is causing a problem?
EDIT: The fortunes file I'm using is FreeBSD's as found at http://fortunes.cat-v.org/freebsd/, which weighs in at about 2 MB. The best runtime for Racket 5.1.3 x64 on OS X Lion was:
real 1m1.479s
user 0m57.400s
sys 0m0.691s
For Python 2.7.1 x64, it was:
real 0m0.057s
user 0m0.029s
sys 0m0.015s
Eli's right that the time is being spent almost entirely in regexp-split (although a full second appears to be spent in port->string), but it's not clear to me that there's a preferred yet equally simple method.

Looks like most of the cost is due to running regexp-split on a string. The fastest alternative that I found was splitting a byte-string, then converting the results to a strings:
(map bytes->string/utf-8
(call-with-input-file "db"
(λ (i) (regexp-split #rx#"%" (port->bytes i)))))
With a random fortune DB of ~2MB, your code takes about 35s, and this version takes 33ms.
(I'm not sure why it takes so long on a string, yet, but it's definitely way too slow.)
EDIT: We tracked it to an efficiency bug. Rough description: when Racket does a regexp-match on a string, it will convert large parts of the string to a byte string (in UTF-8) for the search. This function is the core one that is implemented in C. regexp-split uses it repeatedly to find all of the matches, and therefore keeps re-doing this conversion. I'm looking at a way to do things better, but until it's fixed, use the above workaround.

This is now fixed in the latest Git HEAD version of Racket, see: github.com/plt/racket/commit/8eefaba. Your example now runs in 0.1 seconds for me.

Related

Even when using the same randomseed in Lua, get different results?

I have a large, rather complicated procedural content generation lua project. One thing I want to be able to do, for debugging purposes, is use a random seed so that I can re-run the system & get the same results.
To the end, I print out the seed at the start of a run. The problem is, I still get completely different results each time I run it. Assuming the seed doesn't change anywhere else, this shouldn't be possible, right?
My question is, what other ways are there to influence the output of lua's math.random()? I've searched through all the code in the project, and there's only one place where I call math.randomseed(), and I do that before I do anything else. I don't use the time or date for any calculations, so that wouldn't be influencing the results... What else could I be missing?
Updated on 2/22/16 monkey patching math.random & math.randomseed has, oftentimes (but not always) output the same sequence of random numbers. But still not the same results – so I guess the real question is now: what behavior in lua is indeterminate, and could result in different output when the same code is run in sequence? Noting where it diverges, when it does, is helping me narrow it down, but I still haven't found it. (this code does NOT use coroutines, so I don't think it's a threading / race condition issue)
randomseed is using srandom/srand function, which "sets its argument as the seed for a new sequence of pseudo-random integers to be returned by random()".
I can offer several possible explanations:
you think you call randomseed, but you do not (random will initialize the sequence for you in this case).
you think you call randomseed once, but you call it multiple times (or some other part of the code calls randomseed as well, possibly at different times in your sequence).
some other part of the code calls random (some number of times), which generates different results for your part of the code.
there is nothing wrong with the generated sequence, but you are misinterpreting the results.
your version of Lua has a bug in srandom/random processing.
there is something wrong with srandom or random function in your system.
Having some information about your version of Lua and your system (in addition to the small example demonstrating the issue) would help in figuring out what's causing this.
Updated on 2016/2/22: It should be fairly easy to check; monkeypatch both math.randomseed and math.random and log all the calls and the values returned by the functions for two subsequent runs. Compare the results. If the results differ, you should be able to isolate why they differ and reproduce on a smaller example. You can also look at where the functions are called from using debug.traceback.
Correct, as stated in the documentation, 'equal seeds produce equal sequences of numbers.'
Immediately after setting the seed to a known constant value, output a call to rand - if this varies across runs, you know something is seriously wrong (corrupt library download, whack install, gamma ray hit your drive, etc).
Assuming that the first value matches across runs, add another output midway through the code. From there, you can use a binary search to zero in on where things go wrong (I.E. first half or second half of the code block in question).
While you can & should use some intuition to find the error as you go, keep in mind that if intuition alone was enough, you would have already found it, thus a bit of systematic elimination is warranted.
Revision to cover comment regarding array order:
If possible, use debugging tools. This SO post on detecting when the value of a Lua variable changes might help.
In the absence of tools, here's one way to roll your own for this problem:
A full debugging dump of any sizable array quickly becomes a mess that makes it tough to spot changes. Instead, I'd use a few extra variables & a test function to keep things concise.
Make two deep copies of the array. Let's call them debug01 & debug02 & call the original array original. Next, deliberately swap the order of two elements in debug02.
Next, build a function to compare two arrays & test if their elements match up & return / print the index of the first mismatch if they do not. Immediately after initializing the arrays, test them to ensure:
original & debug01 match
original & debug02 do not match
original & debug02 mismatch where you changed them
I cannot stress enough the insanity of using an unverified (and thus, potentially bugged) test function to track down bugs.
Once you've verified the function works, you can again use a binary search to zero in on where things go off the rails. As before, balance the use of a systematic search with your intuition.

About speed of procedures between user-made and built-in in scheme (related with SICP exercise 1.23)

//My question was so long So I reduced.
In scheme, user-made procedures consume more time than built-in procedures?
(If both's functions are same)
//This is my short version question.
//Below is long long version question.
EX 1.23 is problem(below), why the (next) procedure isn't twice faster than (+ 1)?
This is my guess.
reason 1 : (next) contains 'if' (special-form) and it consumes time.
reason 2 : function call consumes time.
http://community.schemewiki.org/?sicp-ex-1.23 says reason 1 is right.
And I want to know reason 2 is also right.
SO I rewrote the (next) procedure. I didn't use 'if' and checked the number divided by 2 just once before use (next)(so (next) procedure only do + 2). And I remeasured the time. It was more fast than before BUT still not 2. SO I rewrote again. I changed (next) to (+ num 2). Finally It became 2 or almost 2. And I thought why. This is why I guess the 'reason 2'. I want to know what is correct answer.
ps. I'm also curious about why some primes are being tested (very?) fast than others? It doesn't make sense because if a number n is prime, process should see from 2 to sqrt(n). But some numbers are tested faster. Do you know why some primes are tested faster?
Exercise 1.23. The smallest-divisor procedure shown at the start of this section does lots of needless testing: After it checks to see if the number is divisible by 2 there is no point in checking to see if it is divisible by any larger even numbers. This suggests that the values used for test-divisor should not be 2, 3, 4, 5, 6, ..., but rather 2, 3, 5, 7, 9, .... To implement this change, define a procedure next that returns 3 if its input is equal to 2 and otherwise returns its input plus 2. Modify the smallest-divisor procedure to use (next test-divisor) instead of (+ test-divisor 1). With timed-prime-test incorporating this modified version of smallest-divisor, run the test for each of the 12 primes found in exercise 1.22. Since this modification halves the number of test steps, you should expect it to run about twice as fast. Is this expectation confirmed? If not, what is the observed ratio of the speeds of the two algorithms, and how do you explain the fact that it is different from 2?
Where the book is :
http://mitpress.mit.edu/sicp/full-text/book/book-Z-H-11.html#%_sec_1.2.6
Your long ans short questions are actually addressing two different problems.
EX 1.23 is problem(below), why the (next) procedure isn't twice faster than (+ 1)?
You provide two possible reasons to explain the relative lack of speed up (and 30% speed gain is already a good achievement):
The apparent use of a function next instead of a simple arithmetic expression (as I understand it from your explanations),
the inclusion of a test in that very next function,
The first reason is an illusion: (+ 1) is a function incrementing its argument, so in both cases there is a function call (although the increment function is certainly a builtin one, a point which is addressed by your other question).
The second reason is indeed relevant: any test in a block of code will introduce a potential redirection in the code flow (a jump from the current executing instruction to some other address in the program), which may incur a delay. Note that this is analogous to a function call, which will also induce an address jump, so both reasons actually resolve to only one potential cause.
Regarding your short question, builtin functions are indeed usually faster, because the compiler is able to apply a special treatment to them in certain cases. This is due to the facts that:
knowing the semantics of the builtins, compiler designers are able to include special rules pertaining to the algebraic properties of these builtins, and for instance fuse successive incréments in a single call, or suppress a combination of increment and decrement called in sequence.
A builtin function call, when not optimised away, will be converted into a native machine code function call, which may not have to abide to all the calling convention rules. If your scheme compiler produce machine code from the source, then there might be only a marginal gain, but if it produce so called bytecode, the gain might be quite substantial, since user written functions will be translated to that bytecode format, and still require some form of interpretation. If you are only using an interpreter, then the gain is even more important.
I believe this is highly implementation and setting dependent. In many implementations there are different kinds of optimizations and in some there are none. To get the best performance you may need to compile your code or use settings to reduce debug information / stack traces. Getting the best performance in one implementation can worsen the performance in another.
Primitive procedures are usually compiled to be native and in some implementations, like ikarus, it's even inlined. (When you do (map car lst) ikarus changes it to (map (lambda (x) (car x)) lst) since car isn't a procedure) Lambdas are supposed to be cheap.. Remember many scheme implementations change your code to CPS and that is one procedure call for each expression in the body of a procedure call. It will never be as fast as machine code since it needs to do load closure variables.
To check which of the two options which are correct for your implementation make next do the same as it originally did. eg. no if but just increment the argument. The difference now is the extra call and nothing else. Then you can inline next by writing it's code directly in your procedure and substituting arguments for the operands. Is it still slower, then it's if. you need to run the tests several times, preferably with large enough number of primes to produce that it runs for a minute or so. Use time or similar in both the Scheme implementations to get the differences in ms. I use unix time command as well to see how the OS reflects on it.
You should also test to see if you get the same reason in some other implementation. It's not like it's not enough Scheme implementations out there so know yourself out! The differences between them might amaze you. I always use racket (raco exe source to make executable) and Ikarus. WHen doing a large test I include Chicken, Gambit and Chibi.

How long can I expect grep to take on a 10 TB file?

I have a 10 TB file with words from multiple books, and I'm trying to grep for some uncommon strings (no regex). For example:
grep "cappucino" filename
I'm trying to estimate how long this will take. I'm not really looking for whether it's the right approach or not. I'd like to learn more about what really happens under the hood when I call grep.
Please correct me if I'm wrong:
I use mechanical harddrive with roughly 200 MB/s read speed, so it will take roughly 10 million / 200 = 50000 seconds = 14 hours to finish. Is this an accurate estimate?
The short answer is: no.
The longer answer is: it depends.
The even longer answer is: grep's performance depends on a lot of things:
are you running a fixed string search (-F, fgrep) or not - grep uses Boyer-Moore algorithm which by itself isn't capable of finding regular expressions so what grep does (or at least used to do) is it first finds a fixed string in your regexp, tries to find it using BM in the text and do a regexp match (not sure about the current implementation whether it uses an NFA or a DFA implementation, probably a hybrid)
how long is your pattern - BM works faster for longer patterns
how many matches will you have - the less the matches the faster it will be
what is your CPU and memory - hard drive will help you only during reading not during computation time
what other options are you using with your grep
14 hours might not even be your lower bound because Boyer-Moore is smart enough to compute an offset at which next possible match might occur so it doesn't need to read-in the whole file. This does depend on the implementation though and is just my speculation. After re-running the below test with a much longer pattern I was able to go down to 0.23sec and I don't think my disk is that fast. But there might be some caching involved instead.
For instance I'm running on a 500MB/s SSD (at least that's what the manufacturer says) and grepping a 200MB file with a very short pattern (few chars) gives me:
With 808320 hits
real 0m1.734s
user 0m1.334s
sys 0m0.120s
With 0 hits:
real 0m0.059s
user 0m0.046s
sys 0m0.016s
#Edit: in short read about Boyer-Moore :-)
#Edit2: well to check how grep works you should instead check the source code, I described a very general workflow above.

How many lines of machine code are generated by one statement in programming language X?

Reading an article about Lost Programming Skills, the author brings up this chat:
Me: How much horsepower do you need?
SE: I don't know.
Me: Let's see, how many lines of code in your main loop?
SE: 10,000.
Me: what language?
SE: Fortran
Me: ok, that's about 10 lines of machine code per line of Fortran, so
100,000 instructions per loop; how many times does the loop execute per
second?
SE: every 1/20th of a second.
Me: OK, so that's 20 x 100,000 = 2mops (which was faster than anything we had
at the time), maybe we'd better rethink this.
Which makes me wonder, what is the number for modern languages, say Ruby? How does one find out?
i dont think there would be an exact no. saying "for languange x the compiled binary has y lines per source code line". But if you still want to find out may be you can take a large no. of compiled code and corresponding source code and find out the average per source code line.
You can open the binary with any binary editor to see how many lines it generates. for eg. Ollydbg
In terms of determining how long a piece of code will take to execute, that doesn't even really work for Fortran any more! If you write this in Fortran 90:
SUBROUTINE foo(x, y)
IMPLICIT NONE
REAL, DIMENSION(:), INTENT(IN) :: x
REAL, DIMENSION(:), INTENT(OUT) :: y
y = EXP(x)
END SUBROUTINE foo
the line that says y = EXP(x) can take arbitrarily long to execute, depending on the size of the arrays x and y. The same goes for any language with vector assignment.
In the chat they where trying to estimate CPU performance.
If you know CPU performance and time of execution of the loop you can get number of CPU commands per loop and then per line.
Calculation in your chat is not precises.
You can do similar unprecise calculations even for ruby.
Be aware that it wrong to say that one fortran line is 10 CPU commands BUT is average for certain loop it was true.
Estimate time taken by your loop in ruby.
Multiply your CPU performance (in operations per second) on loop time. You will get operations per second.
Divide operations per second on number of lines in loop. That is your value for your loop.
For X="C#" you might want to take a look at Faster Managed Code: Know What Things Cost from Microsoft. It says, that (particular) modern languages are heavily optimized before actually touching the hardware.

How can I tell if my tail-recursive Scheme function is being optimized correctly

I have a Scheme function who's basic form looks like this
(define (foo param var)
(cond ((end-condition) (return-something))
((other-end-condit) (return-something-else))
(else
(let ((newvar (if some-condition
(make-some-updated var)
(destructive-update! var))))
(foo param newvar)))))
I feel like this is pretty clearly something that needs to be optimized to iteration in compilation, but when I compile it (with chicken) it still runs incredibly slowly. (if I understand the R5RS specs: http://groups.csail.mit.edu/mac/ftpdir/scheme-reports/r5rs-html.old/r5rs_22.html, this looks like it should work)
I wrote the same exact algorithm with a while loop in python and the interpreted program terminated in seconds. My compiled scheme takes about 15 minutes, and I am positive the algorithm is the same.
I think this is a tail recursion not getting optimized issue, as I can't think what else it could possibly be, but I cant figure it out. Any ideas? For what its worth, the var is a hash and the destructive update is merely adding an element, although it also returns the updated hash to be passed in as newvar.
That function is indeed tail-recursive, so you're good there. However, tail recursion just means that stack space won't grow, not that your program is guaranteed to run fast. If you want to see if your program is really running tail-recursively, run it while watching the total memory taken by Chicken (and make sure you aren't allocating memory in make-some-updated, which you might be). If the memory grows, then Chicken isn't compiling your program correctly according to the standard.

Resources