Time Code in PLT-Scheme - time

I want to see how long a function takes to run. What's the easiest way to do this in PLT-Scheme? Ideally I'd want to be able to do something like this:
> (define (loopy times)
(if (zero? times)
0
(loopy (sub1 times))))
> (loopy 5000000)
0 ;(after about a second)
> (timed (loopy 5000000))
Took: 0.93 seconds
0
>
It doesn't matter if I'd have to use some other syntax like (timed loopy 5000000) or (timed '(loopy 5000000)), or if it returns the time taken in a cons or something.

The standard name for timing the execution of expressions in most Scheme implementations is "time". Here is an example from within DrRacket.
(define (loopy times)
(if (zero? times)
0
(loopy (sub1 times))))
(time (loopy 5000000))
cpu time: 1526 real time: 1657 gc time: 0
0
If you use time to benchmark different implementations against each other,
remember to use racket from the command line rather than benchmarking directly
in DrRacket (DrRacket inserts debug code in order to give better error messages).

Found it...
From the online documentation:
(time-apply proc arg-list) invokes the procedure proc with the arguments in arg-list. Four values are returned: a list containing the result(s) of applying proc, the number of milliseconds of CPU time required to obtain this result, the number of ``real'' milliseconds required for the result, and the number of milliseconds of CPU time (included in the first result) spent on garbage collection.
Example usage:
> (time-apply loopy '(5000000))
(0)
621
887
0

Related

Up-to-date Prolog implementation benchmarks?

Are there any up-to-date Prolog implementation benchmarks (with results)?
I found this on the mercury web site. Surprisingly, it shows a 20-fold gap between swi-prolog and Aquarius. I suspect that these results are pretty old. Does this gap still hold? Personally, I'd also like to see some comparisons with the occurs check turned on, since it has a major impact on performance, and some compilers might be better than others at optimizing it away.
Of more recent comparisons, I found this claim that gnu-prolog is 2x faster than SWI, and YAP is 4x faster than SWI on one specific code base.
Edit:
a specific case where the occurs check is needed for a real world problem
Sure: type inference in Haskell, OCaml, Swift or theorem provers such as this one. I also think the burden is on the programmer to prove that his code doesn't need the occurs check. Tests can only prove that you do need it, not that you don't need it.
I have some benchmark results published at:
https://logtalk.org/performance.html
Be sure to read and understand the notes at the end of that page, however.
Regarding running benchmarks with GNU Prolog, note that you cannot use the top-level interpreter as code loaded from it is interpreted, not compiled (see GNU Prolog documentation on gplc). In general, is not uncommon to see people running benchmarks from the top-level interpreter, forgetting what the word interpreter means, and publishing bogus stats where compilation/term-expansion/... steps mistakenly end up mixed with what's supposed to be benchmarked.
There's also a classical set of Prolog benchmarks that can be used for comparing Prolog implementations. Some Prolog systems include them (e.g. SWI-Prolog). They are also included in the Logtalk distribution, which allows running them with the supported backends:
https://github.com/LogtalkDotOrg/logtalk3/tree/master/examples/bench
In the current Logtalk git version, you can start it with the backend you want to benchmark and use the queries:
?- {bench(loader)}.
...
?- run.
These will run each benchmark 1000 times are reported the total time. Use run/1 for a different number of repetitions. For example, in my macOS system using SWI-Prolog 8.3.15 I get:
?- run.
boyer: 20.897818 seconds
chat_parser: 7.962188999999999 seconds
crypt: 0.14653999999999812 seconds
derive: 0.004462999999997663 seconds
divide10: 0.002300000000001745 seconds
log10: 0.0011489999999980682 seconds
meta_qsort: 0.2729539999999986 seconds
mu: 0.04534600000000211 seconds
nreverse: 0.016964000000001533 seconds
ops8: 0.0016230000000021505 seconds
poly_10: 1.9540520000000008 seconds
prover: 0.05286200000000463 seconds
qsort: 0.030829000000004214 seconds
queens_8: 2.2245050000000077 seconds
query: 0.11675499999999772 seconds
reducer: 0.00044199999999960937 seconds
sendmore: 3.048624999999994 seconds
serialise: 0.0003770000000073992 seconds
simple_analyzer: 0.8428750000000065 seconds
tak: 5.495768999999996 seconds
times10: 0.0019139999999993051 seconds
unify: 0.11229400000000567 seconds
zebra: 1.595203000000005 seconds
browse: 31.000829000000003 seconds
fast_mu: 0.04102400000000728 seconds
flatten: 0.028527999999994336 seconds
nand: 0.9632950000000022 seconds
perfect: 0.36678499999999303 seconds
true.
For SICStus Prolog 4.6.0 I get:
| ?- run.
boyer: 3.638 seconds
chat_parser: 0.7650000000000006 seconds
crypt: 0.029000000000000803 seconds
derive: 0.0009999999999994458 seconds
divide10: 0.001000000000000334 seconds
log10: 0.0009999999999994458 seconds
meta_qsort: 0.025000000000000355 seconds
mu: 0.004999999999999893 seconds
nreverse: 0.0019999999999997797 seconds
ops8: 0.001000000000000334 seconds
poly_10: 0.20500000000000007 seconds
prover: 0.005999999999999339 seconds
qsort: 0.0030000000000001137 seconds
queens_8: 0.2549999999999999 seconds
query: 0.024999999999999467 seconds
reducer: 0.001000000000000334 seconds
sendmore: 0.6079999999999997 seconds
serialise: 0.0019999999999997797 seconds
simple_analyzer: 0.09299999999999997 seconds
tak: 0.5869999999999997 seconds
times10: 0.001000000000000334 seconds
unify: 0.013000000000000789 seconds
zebra: 0.33999999999999986 seconds
browse: 4.137 seconds
fast_mu: 0.0070000000000014495 seconds
nand: 0.1280000000000001 seconds
perfect: 0.07199999999999918 seconds
yes
For GNU Prolog 1.4.5, I use the sample embedding script in logtalk3/scripts/embedding/gprolog to create an executable that includes the bench example fully compiled:
| ?- run.
boyer: 9.3459999999999983 seconds
chat_parser: 1.9610000000000003 seconds
crypt: 0.048000000000000043 seconds
derive: 0.0020000000000006679 seconds
divide10: 0.00099999999999944578 seconds
log10: 0.00099999999999944578 seconds
meta_qsort: 0.099000000000000199 seconds
mu: 0.012999999999999901 seconds
nreverse: 0.0060000000000002274 seconds
ops8: 0.00099999999999944578 seconds
poly_10: 0.72000000000000064 seconds
prover: 0.016000000000000014 seconds
qsort: 0.0080000000000008953 seconds
queens_8: 0.68599999999999994 seconds
query: 0.041999999999999815 seconds
reducer: 0.0 seconds
sendmore: 1.1070000000000011 seconds
serialise: 0.0060000000000002274 seconds
simple_analyzer: 0.25 seconds
tak: 1.3899999999999988 seconds
times10: 0.0010000000000012221 seconds
unify: 0.089999999999999858 seconds
zebra: 0.63499999999999979 seconds
browse: 10.923999999999999 seconds
fast_mu: 0.015000000000000568 seconds
(27352 ms) yes
I suggest you try these benchmarks, running them on your computer, with the Prolog systems that you want to compare. In doing that, remember that this is a limited set of benchmarks, not necessarily reflecting the actual relative performance in non-trivial applications.
Ratios:
SICStus/SWI GNU/SWI
boyer 17.4% 44.7%
browse 13.3% 35.2%
chat_parser 9.6% 24.6%
crypt 19.8% 32.8%
derive 22.4% 44.8%
divide10 43.5% 43.5%
fast_mu 17.1% 36.6%
flatten - -
log10 87.0% 87.0%
meta_qsort 9.2% 36.3%
mu 11.0% 28.7%
nand 13.3% -
nreverse 11.8% 35.4%
ops8 61.6% 61.6%
perfect 19.6% -
poly_10 10.5% 36.8%
prover 11.4% 30.3%
qsort 9.7% 25.9%
queens_8 11.5% 30.8%
query 21.4% 36.0%
reducer 226.2% 0.0%
sendmore 19.9% 36.3%
serialise 530.5% 1591.5%
simple_analyzer 11.0% 29.7%
tak 10.7% 25.3%
times10 52.2% 52.2%
unify 11.6% 80.1%
zebra 21.3% 39.8%
P.S. Be sure to use Logtalk 3.43.0 or later as it includes portability fixes for the bench example, including for GNU Prolog, and a set of basic unit tests.
I stumbled upon this comparison from 2008 in the Internet archive:
https://web.archive.org/web/20100227050426/http://www.probp.com/performance.htm

Improve Fortran formatted I/O with a large number of small files

Lets assume I have the following requirements for writing monitor files from a simulation:
A large number of individual files has to be written, typically in the order of 10000
The files must be human-readable, i.e. formatted I/O
Periodically, a new line is added to each file. Typically every 50 seconds.
The new data has to be accessible almost instantly, so large manual write buffers are not an option
We are on a Lustre file system that appears to be optimized for just about the opposite: sequential writes to a small number of large files.
It was not me who formulated the requirements so unfortunately there is no point in discussing them. I would just like to find the best possible solution with above prerequisites.
I came up with a little working example to test a few implementations. Here is the best I could do so far:
!===============================================================!
! program to test some I/O implementations for many small files !
!===============================================================!
PROGRAM iotest
use types
use omp_lib
implicit none
INTEGER(I4B), PARAMETER :: steps = 1000
INTEGER(I4B), PARAMETER :: monitors = 1000
INTEGER(I4B), PARAMETER :: cachesize = 10
INTEGER(I8B) :: counti, countf, count_rate, counti_global, countf_global
REAL(DP) :: telapsed, telapsed_global
REAL(DP), DIMENSION(:,:), ALLOCATABLE :: density, pressure, vel_x, vel_y, vel_z
INTEGER(I4B) :: n, t, unitnumber, c, i, thread
CHARACTER(LEN=100) :: dummy_char, number
REAL(DP), DIMENSION(:,:,:), ALLOCATABLE :: writecache_real
call system_clock(counti_global,count_rate)
! allocate cache
allocate(writecache_real(5,cachesize,monitors))
writecache_real = 0.0_dp
! fill values
allocate(density(steps,monitors), pressure(steps,monitors), vel_x(steps,monitors), vel_y(steps,monitors), vel_z(steps,monitors))
do n=1, monitors
do t=1, steps
call random_number(density(t,n))
call random_number(pressure(t,n))
call random_number(vel_x(t,n))
call random_number(vel_y(t,n))
call random_number(vel_z(t,n))
end do
end do
! create files
do n=1, monitors
write(number,'(I0.8)') n
dummy_char = 'monitor_' // trim(adjustl(number)) // '.dat'
open(unit=20, file=trim(adjustl(dummy_char)), status='replace', action='write')
close(20)
end do
call system_clock(counti)
! write data
c = 0
do t=1, steps
c = c + 1
do n=1, monitors
writecache_real(1,c,n) = density(t,n)
writecache_real(2,c,n) = pressure(t,n)
writecache_real(3,c,n) = vel_x(t,n)
writecache_real(4,c,n) = vel_y(t,n)
writecache_real(5,c,n) = vel_z(t,n)
end do
if(c .EQ. cachesize .OR. t .EQ. steps) then
!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(n,number,dummy_char,unitnumber, thread)
thread = OMP_get_thread_num()
unitnumber = thread + 20
!$OMP DO
do n=1, monitors
write(number,'(I0.8)') n
dummy_char = 'monitor_' // trim(adjustl(number)) // '.dat'
open(unit=unitnumber, file=trim(adjustl(dummy_char)), status='old', action='write', position='append', buffered='yes')
write(unitnumber,'(5ES25.15)') writecache_real(:,1:c,n)
close(unitnumber)
end do
!$OMP END DO
!$OMP END PARALLEL
c = 0
end if
end do
call system_clock(countf)
call system_clock(countf_global)
telapsed=real(countf-counti,kind=dp)/real(count_rate,kind=dp)
telapsed_global=real(countf_global-counti_global,kind=dp)/real(count_rate,kind=dp)
write(*,*)
write(*,'(A,F15.6,A)') ' elapsed wall time for I/O: ', telapsed, ' seconds'
write(*,'(A,F15.6,A)') ' global elapsed wall time: ', telapsed_global, ' seconds'
write(*,*)
END PROGRAM iotest
The main features are: OpenMP parallelization and a manual write buffer.
Here are some of the timings on the Lustre file system with 16 threads:
cachesize=5: elapsed wall time for I/O: 991.627404 seconds
cachesize=10: elapsed wall time for I/O: 415.456265 seconds
cachesize=20: elapsed wall time for I/O: 93.842964 seconds
cachesize=50: elapsed wall time for I/O: 79.859099 seconds
cachesize=100: elapsed wall time for I/O: 23.937832 seconds
cachesize=1000: elapsed wall time for I/O: 10.472421 seconds
For reference the results on a local workstation HDD with deactivated HDD write cache, 16 threads:
cachesize=1: elapsed wall time for I/O: 5.543722 seconds
cachesize=2: elapsed wall time for I/O: 2.791811 seconds
cachesize=3: elapsed wall time for I/O: 1.752962 seconds
cachesize=4: elapsed wall time for I/O: 1.630385 seconds
cachesize=5: elapsed wall time for I/O: 1.174099 seconds
cachesize=10: elapsed wall time for I/O: 0.700624 seconds
cachesize=20: elapsed wall time for I/O: 0.433936 seconds
cachesize=50: elapsed wall time for I/O: 0.425782 seconds
cachesize=100: elapsed wall time for I/O: 0.227552 seconds
As you can see the implementation is still embarrassingly slow on the Lustre file system compared to an ordinary HDD and I would need huge buffer sizes to reduce the I/O overhead to a tolerable extent. This would mean that the output lags behind which is against the requirements formulated earlier.
Another promising approach was leaving the units open between consecutive writes. Unfortunately, the number of units open simultaneously is limited to typically 1024-4096 without root privileges. So this is not an option because the number of files can exceed this limit.
How could the I/O overhead be further reduced while still fulfilling the requirements?
Edit 1
From the discussion with Gilles I learned that the lustre file system can be tweaked even with normal user privileges. So I tried setting the stripe count to 1 as suggested (this was already the default setting) and decreased the stripe size to the minimum supported value of 64k (default value was 1M). However, this did not improve I/O performance with my test case. If anyone has additional hints on more suitable file system settings please let me know.
For everyone suffering from small files performance, the new lustre release 2.11 allows storing the small files directly on MDT, which improves access time for those.
http://cdn.opensfs.org/wp-content/uploads/2018/04/Leers-Lustre-Data_on_MDT_An_Early_Look_DDN.pdf
lfs setstripe -E 1M -L mdt -E -1 fubar fill store the first megabyte of all files in directory fubar on MDT

Why does reserving capacity in a gvector in Racket make performance worse?

Using the following simple benchmark in Racket 6.6:
#lang racket
(require data/gvector)
(define (run)
;; this should have to periodically resize in order to incorporate new data
;; and thus should be slower
(time (define v (make-gvector)) (for ((i (range 1000000))) (gvector-add! v i)) )
(collect-garbage 'major)
;; this should never have to resize and thus should be faster
;; ... but consistently benchmarks slower?!
(time (define v (make-gvector #:capacity 1000000)) (for ((i (range 1000000))) (gvector-add! v i)) )
)
(run)
The version that properly reserves capacity does worse consistently. Why? This is certainly not the result that I would expect, and is inconsistent with what you would see in C++ (std::vector) or Java (ArrayList). Am I somehow benchmarking incorrectly?
Example output:
cpu time: 232 real time: 230 gc time: 104
cpu time: 228 real time: 230 gc time: 120
One benchmarking comment: use in-range instead of range in your microbenchmarks; otherwise you're including the cost of constructing a million-element list in your measurements.
I added some extra loops to your microbenchmark to make it do more work (and I fixed the range issue). Here are some of the results:
Using #:capacity for large capacities is slower.
== 5 iterations of 1e7 sized gvector, measured 3 times each way
with #:capacity
cpu time: 9174 real time: 9169 gc time: 4769
cpu time: 9109 real time: 9108 gc time: 4683
cpu time: 9094 real time: 9091 gc time: 4670
without
cpu time: 7917 real time: 7912 gc time: 3243
cpu time: 7703 real time: 7697 gc time: 3107
cpu time: 7732 real time: 7727 gc time: 3115
Using #:capacity for small capacities is faster.
== 20 iterations of 1e6 sized gvector, measured three times each way
with #:capacity
cpu time: 2167 real time: 2168 gc time: 408
cpu time: 2152 real time: 2152 gc time: 385
cpu time: 2112 real time: 2111 gc time: 373
without
cpu time: 2310 real time: 2308 gc time: 473
cpu time: 2316 real time: 2315 gc time: 480
cpu time: 2335 real time: 2334 gc time: 488
My hypothesis: it's GC overhead. When the backing vector is mutated, Racket's generational GC remembers the vector so it can scan it in the next minor collection. When the backing vector is very big, scanning the whole vector on every minor GC outweighs the cost of reallocation and copying. The overhead wouldn't occur with a GC with a finer remembered-set granularity (but... tradeoffs).
BTW, looking over the gvector code I found a couple opportunities for improvement. They don't change the big picture, though.
Increasing the vector size with a factor 10 I get the following in DrRacket
(with all debugging turned off):
cpu time: 5245 real time: 5605 gc time: 3607
cpu time: 4851 real time: 5136 gc time: 3231
Note: If there is garbage left over from the first benchmark it can affect the next one. Therefore use collect-garbage (three times) before using time again.
Also... don't make benchmarks in DrRacket as I did - use the command line.

Why not sequence- functions for all in Racket

Is there any disadvantage of using sequence-length, sequence-ref, sequence-map etc rather than different functions for lists (length list-ref etc), strings (string-length, string-ref etc), vectors etc in Racket?
Performance.
Consider this tiny benchmark:
#lang racket/base
(require racket/sequence)
(define len 10000)
(define vec (make-vector len))
(collect-garbage)
(collect-garbage)
(collect-garbage)
(time (void (for/list ([i (in-range len)])
(vector-ref vec i))))
(collect-garbage)
(collect-garbage)
(collect-garbage)
(time (void (for/list ([i (in-range len)])
(sequence-ref vec i))))
This is the output on my machine:
; vectors (vector-ref vs sequence-ref)
cpu time: 1 real time: 1 gc time: 0
cpu time: 2082 real time: 2081 gc time: 0
Yes, that’s a difference of 3 orders of magnitude.
Why? Well, racket/sequence is not a terribly “smart” API, and even though vectors are random access, sequence-ref is not. Combined with the ability of the Racket optimizer to heavily optimize primitive operations, the sequence API is a pretty poor interface.
Of course, this is a little unfair, because vectors are random access while things like lists are not. However, performing the exact same test as the one above but using lists instead of vectors still yields a pretty grim result:
; lists (list-ref vs sequence-ref)
cpu time: 113 real time: 113 gc time: 0
cpu time: 1733 real time: 1732 gc time: 0
The sequence API is slow, mostly because of a high level of indirection.
Now, performance alone is not a reason to reject an API outright, since there are concrete advantages to working at a higher level of abstraction. That said, I think the sequence API is not a good abstraction, because it:
…is needlessly stateful in its implementation, which puts an unnecessary burden on implementors of the interface.
…does not accommodate things that do not resemble lists, such as random-access vectors or hash tables.
If you want to work with a higher level API, one possible option is to use the collections package, which attempts to provide an API similar to racket/sequence, but accommodating more kinds of data structures and also having a more complete set of functions. Disclaimer: I am the author of the collections package.
Given the above benchmark once more, the performance is still worse than using the underlying functions directly, but it’s at least a bit more manageable:
; vectors (vector-ref vs ref)
cpu time: 2 real time: 1 gc time: 0
cpu time: 97 real time: 98 gc time: 10
; lists (list-ref vs ref)
cpu time: 104 real time: 103 gc time: 0
cpu time: 481 real time: 482 gc time: 0
Whether or not you can afford the overhead depends on what exactly you’re doing, and it’s up to you to make the call for yourself. The specialized operations will always be at least somewhat faster than the ones that defer to them as long as some sort of dynamic dispatch is being performed. As always, remember the rule of performance optimization: don’t guess, measure.

Why `(map digitToInt) . show` is so fast?

Converting non-negative Integer to its list of digits is commonly done like this:
import Data.Char
digits :: Integer -> [Int]
digits = (map digitToInt) . show
I was trying to find a more direct way to perform the task, without involving a string conversion, but I'm unable to come up with something faster.
Things I've been trying so far:
The baseline:
digits :: Int -> [Int]
digits = (map digitToInt) . show
Got this one from another question on StackOverflow:
digits2 :: Int -> [Int]
digits2 = map (`mod` 10) . reverse . takeWhile (> 0) . iterate (`div` 10)
Trying to roll my own:
digits3 :: Int -> [Int]
digits3 = reverse . revDigits3
revDigits3 :: Int -> [Int]
revDigits3 n = case divMod n 10 of
(0, digit) -> [digit]
(rest, digit) -> digit:(revDigits3 rest)
This one was inspired by showInt in Numeric:
digits4 n0 = go n0 [] where
go n cs
| n < 10 = n:cs
| otherwise = go q (r:cs)
where
(q,r) = n `quotRem` 10
Now the benchmark. Note: I'm forcing the evaluation using filter.
λ>:set +s
λ>length $ filter (>5) $ concat $ map (digits) [1..1000000]
2400000
(1.58 secs, 771212628 bytes)
This is the reference. Now for digits2:
λ>length $ filter (>5) $ concat $ map (digits2) [1..1000000]
2400000
(5.47 secs, 1256170448 bytes)
That's 3.46 times longer.
λ>length $ filter (>5) $ concat $ map (digits3) [1..1000000]
2400000
(7.74 secs, 1365486528 bytes)
digits3 is 4.89 time slower. Just for fun, I tried using only revDigits3 and avoid the reverse.
λ>length $ filter (>5) $ concat $ map (revDigits3) [1..1000000]
2400000
(8.28 secs, 1277538760 bytes)
Strangely, this is even slower, 5.24 times slower.
And the last one:
λ>length $ filter (>5) $ concat $ map (digits4) [1..1000000]
2400000
(16.48 secs, 1779445968 bytes)
This is 10.43 time slower.
I was under the impression that only using arithmetic and cons would outperform anything involving a string conversion. Apparently, there something I can't grasp.
So what the trick? Why is digits so fast?
I'm using GHC 6.12.3.
Seeing as I can't add comments yet, I'll do a little bit more work and just analyze all of them. I'm putting the analysis at the top; however, the relevant data is below. (Note: all of this is done in 6.12.3 as well - no GHC 7 magic yet.)
Analysis:
Version 1: show is pretty good for ints, especially those as short as we have. Making strings actually tends to be decent in GHC; however reading to strings and writing large strings to files (or stdout, although you wouldn't want to do that) are where your code can absolutely crawl. I would suspect that a lot of the details behind why this is so fast are due to clever optimizations within show for Ints.
Version 2: This one was the slowest of the bunch when compiled. Some problems: reverse is strict in its argument. What this means is that you don't benefit from being able to perform computations on the first part of the list while you're computing the next elements; you have to compute them all, flip them, and then do your computations (namely (`mod` 10) ) on the elements of the list. While this may seem small, it can lead to greater memory usage (note the 5GB of heap memory allocated here as well) and slower computations. (Long story short: don't use reverse.)
Version 3: Remember how I just said don't use reverse? Turns out, if you take it out, this one drops to 1.79s total execution time - barely slower than the baseline. The only problem here is that as you go deeper into the number, you're building up the spine of the list in the wrong direction (essentially, you're consing "into" the list with recursion, as opposed to consing "onto" the list).
Version 4: This is a very clever implementation. You benefit from several nice things: for one, quotRem should use the Euclidean algorithm, which is logarithmic in its larger argument. (Maybe it's faster, but I don't believe there's anything that's more than a constant factor faster than Euclid.) Furthermore, you cons onto the list as discussed last time, so that you don't have to resolve any list thunks as you go - the list is already entirely constructed when you come back around to parse it. As you can see, the performance benefits from this.
This code was probably the slowest in GHCi because a lot of the optimizations performed with the -O3 flag in GHC deal with making lists faster, whereas GHCi wouldn't do any of that.
Lessons: cons the right way onto a list, watch for intermediate strictness that can slow down computations, and do some legwork in looking at the fine-grained statistics of your code's performance. Also compile with the -O3 flags: whenever you don't, all those people who put a lot of hours into making GHC super-fast get big ol' puppy eyes at you.
Data:
I just took all four functions, stuck them into one .hs file, and then changed as necessary to reflect the function in use. Also, I bumped your limit up to 5e6, because in some cases compiled code would run in less than half a second on 1e6, and this can start to cause granularity problems with the measurements we're making.
Compiler options: use ghc --make -O3 [filename].hs to have GHC do some optimization. We'll dump statistics to standard error using digits +RTS -sstderr.
Dumping to -sstderr gives us output that looks like this, in the case of digits1:
digits1 +RTS -sstderr
12000000
2,885,827,628 bytes allocated in the heap
446,080 bytes copied during GC
3,224 bytes maximum residency (1 sample(s))
12,100 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 5504 collections, 0 parallel, 0.06s, 0.03s elapsed
Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.61s ( 1.66s elapsed)
GC time 0.06s ( 0.03s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 1.67s ( 1.69s elapsed)
%GC time 3.7% (1.5% elapsed)
Alloc rate 1,795,998,050 bytes per MUT second
Productivity 96.3% of total user, 95.2% of total elapsed
There are three key statistics here:
Total memory in use: only 1MB means this version is very space-efficient.
Total time: 1.61s means nothing now, but we'll see how it looks against the other implementations.
Productivity: This is just 100% minus garbage collecting; since we're at 96.3% this means that we're not creating a lot of objects that we leave lying around in memory..
Alright, let's move on to version 2.
digits2 +RTS -sstderr
12000000
5,512,869,824 bytes allocated in the heap
1,312,416 bytes copied during GC
3,336 bytes maximum residency (1 sample(s))
13,048 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 10515 collections, 0 parallel, 0.06s, 0.04s elapsed
Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 3.20s ( 3.25s elapsed)
GC time 0.06s ( 0.04s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 3.26s ( 3.29s elapsed)
%GC time 1.9% (1.2% elapsed)
Alloc rate 1,723,838,984 bytes per MUT second
Productivity 98.1% of total user, 97.1% of total elapsed
Alright, so we're seeing an interesting pattern.
Same amount of memory used. This means that this is a pretty good implementation, although it could mean that we need to test on higher sample inputs to see if we can find a difference.
It takes twice as long. We'll come back to some speculation as to why this is later.
It's actually slightly more productive, but given that GC is not a huge portion of either program this doesn't help us anything significant.
Version 3:
digits3 +RTS -sstderr
12000000
3,231,154,752 bytes allocated in the heap
832,724 bytes copied during GC
3,292 bytes maximum residency (1 sample(s))
12,100 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 6163 collections, 0 parallel, 0.02s, 0.02s elapsed
Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 2.09s ( 2.08s elapsed)
GC time 0.02s ( 0.02s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 2.11s ( 2.10s elapsed)
%GC time 0.7% (1.0% elapsed)
Alloc rate 1,545,701,615 bytes per MUT second
Productivity 99.3% of total user, 99.3% of total elapsed
Alright, so we're seeing some strange patterns.
We're still at 1MB total memory in use. So we haven't hit anything memory-inefficient, which is good.
We're not quite at digits1, but we've got digits2 beat pretty easily.
Very little GC. (Keep in mind that anything over 95% productivity is very good, so we're not really dealing with anything too significant here.)
And finally, version 4:
digits4 +RTS -sstderr
12000000
1,347,856,636 bytes allocated in the heap
270,692 bytes copied during GC
3,180 bytes maximum residency (1 sample(s))
12,100 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 2570 collections, 0 parallel, 0.00s, 0.01s elapsed
Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.09s ( 1.08s elapsed)
GC time 0.00s ( 0.01s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 1.09s ( 1.09s elapsed)
%GC time 0.0% (0.8% elapsed)
Alloc rate 1,234,293,036 bytes per MUT second
Productivity 100.0% of total user, 100.5% of total elapsed
Wowza! Let's break it down:
We're still at 1MB total. This is almost certainly a feature of these implementations, as they remain at 1MB on inputs of 5e5 and 5e7. A testament to laziness, if you will.
We cut off about 32% of our original time, which is pretty impressive.
I suspect that the percentages here reflect the granularity in -sstderr's monitoring rather than any computation on superluminal particles.
Answering the question "why rem instead of mod?" in the comments. When dealing with positive values rem x y === mod x y so the only consideration of note is performance:
> import Test.QuickCheck
> quickCheck (\x y -> x > 0 && y > 0 ==> x `rem` y == x `mod` y)
So what is the performance? Unless you have a good reason not to (and being lazy isn't a good reason, neither is not knowing Criterion) then use a good benchmark tool, I used Criterion:
$ cat useRem.hs
import Criterion
import Criterion.Main
list :: [Integer]
list = [1..10000]
main = defaultMain
[ bench "mod" (nf (map (`mod` 7)) list)
, bench "rem" (nf (map (`rem` 7)) list)
]
Running this shows rem is measurably better than mod (compiled with -O2):
$ ./useRem
...
benchmarking mod
...
mean: 590.4692 us, lb 589.2473 us, ub 592.1766 us, ci 0.950
benchmarking rem
...
mean: 394.1580 us, lb 393.2415 us, ub 395.4184 us, ci 0.950

Resources