Why `(map digitToInt) . show` is so fast? - performance

Converting non-negative Integer to its list of digits is commonly done like this:
import Data.Char
digits :: Integer -> [Int]
digits = (map digitToInt) . show
I was trying to find a more direct way to perform the task, without involving a string conversion, but I'm unable to come up with something faster.
Things I've been trying so far:
The baseline:
digits :: Int -> [Int]
digits = (map digitToInt) . show
Got this one from another question on StackOverflow:
digits2 :: Int -> [Int]
digits2 = map (`mod` 10) . reverse . takeWhile (> 0) . iterate (`div` 10)
Trying to roll my own:
digits3 :: Int -> [Int]
digits3 = reverse . revDigits3
revDigits3 :: Int -> [Int]
revDigits3 n = case divMod n 10 of
(0, digit) -> [digit]
(rest, digit) -> digit:(revDigits3 rest)
This one was inspired by showInt in Numeric:
digits4 n0 = go n0 [] where
go n cs
| n < 10 = n:cs
| otherwise = go q (r:cs)
where
(q,r) = n `quotRem` 10
Now the benchmark. Note: I'm forcing the evaluation using filter.
λ>:set +s
λ>length $ filter (>5) $ concat $ map (digits) [1..1000000]
2400000
(1.58 secs, 771212628 bytes)
This is the reference. Now for digits2:
λ>length $ filter (>5) $ concat $ map (digits2) [1..1000000]
2400000
(5.47 secs, 1256170448 bytes)
That's 3.46 times longer.
λ>length $ filter (>5) $ concat $ map (digits3) [1..1000000]
2400000
(7.74 secs, 1365486528 bytes)
digits3 is 4.89 time slower. Just for fun, I tried using only revDigits3 and avoid the reverse.
λ>length $ filter (>5) $ concat $ map (revDigits3) [1..1000000]
2400000
(8.28 secs, 1277538760 bytes)
Strangely, this is even slower, 5.24 times slower.
And the last one:
λ>length $ filter (>5) $ concat $ map (digits4) [1..1000000]
2400000
(16.48 secs, 1779445968 bytes)
This is 10.43 time slower.
I was under the impression that only using arithmetic and cons would outperform anything involving a string conversion. Apparently, there something I can't grasp.
So what the trick? Why is digits so fast?
I'm using GHC 6.12.3.

Seeing as I can't add comments yet, I'll do a little bit more work and just analyze all of them. I'm putting the analysis at the top; however, the relevant data is below. (Note: all of this is done in 6.12.3 as well - no GHC 7 magic yet.)
Analysis:
Version 1: show is pretty good for ints, especially those as short as we have. Making strings actually tends to be decent in GHC; however reading to strings and writing large strings to files (or stdout, although you wouldn't want to do that) are where your code can absolutely crawl. I would suspect that a lot of the details behind why this is so fast are due to clever optimizations within show for Ints.
Version 2: This one was the slowest of the bunch when compiled. Some problems: reverse is strict in its argument. What this means is that you don't benefit from being able to perform computations on the first part of the list while you're computing the next elements; you have to compute them all, flip them, and then do your computations (namely (`mod` 10) ) on the elements of the list. While this may seem small, it can lead to greater memory usage (note the 5GB of heap memory allocated here as well) and slower computations. (Long story short: don't use reverse.)
Version 3: Remember how I just said don't use reverse? Turns out, if you take it out, this one drops to 1.79s total execution time - barely slower than the baseline. The only problem here is that as you go deeper into the number, you're building up the spine of the list in the wrong direction (essentially, you're consing "into" the list with recursion, as opposed to consing "onto" the list).
Version 4: This is a very clever implementation. You benefit from several nice things: for one, quotRem should use the Euclidean algorithm, which is logarithmic in its larger argument. (Maybe it's faster, but I don't believe there's anything that's more than a constant factor faster than Euclid.) Furthermore, you cons onto the list as discussed last time, so that you don't have to resolve any list thunks as you go - the list is already entirely constructed when you come back around to parse it. As you can see, the performance benefits from this.
This code was probably the slowest in GHCi because a lot of the optimizations performed with the -O3 flag in GHC deal with making lists faster, whereas GHCi wouldn't do any of that.
Lessons: cons the right way onto a list, watch for intermediate strictness that can slow down computations, and do some legwork in looking at the fine-grained statistics of your code's performance. Also compile with the -O3 flags: whenever you don't, all those people who put a lot of hours into making GHC super-fast get big ol' puppy eyes at you.
Data:
I just took all four functions, stuck them into one .hs file, and then changed as necessary to reflect the function in use. Also, I bumped your limit up to 5e6, because in some cases compiled code would run in less than half a second on 1e6, and this can start to cause granularity problems with the measurements we're making.
Compiler options: use ghc --make -O3 [filename].hs to have GHC do some optimization. We'll dump statistics to standard error using digits +RTS -sstderr.
Dumping to -sstderr gives us output that looks like this, in the case of digits1:
digits1 +RTS -sstderr
12000000
2,885,827,628 bytes allocated in the heap
446,080 bytes copied during GC
3,224 bytes maximum residency (1 sample(s))
12,100 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 5504 collections, 0 parallel, 0.06s, 0.03s elapsed
Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.61s ( 1.66s elapsed)
GC time 0.06s ( 0.03s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 1.67s ( 1.69s elapsed)
%GC time 3.7% (1.5% elapsed)
Alloc rate 1,795,998,050 bytes per MUT second
Productivity 96.3% of total user, 95.2% of total elapsed
There are three key statistics here:
Total memory in use: only 1MB means this version is very space-efficient.
Total time: 1.61s means nothing now, but we'll see how it looks against the other implementations.
Productivity: This is just 100% minus garbage collecting; since we're at 96.3% this means that we're not creating a lot of objects that we leave lying around in memory..
Alright, let's move on to version 2.
digits2 +RTS -sstderr
12000000
5,512,869,824 bytes allocated in the heap
1,312,416 bytes copied during GC
3,336 bytes maximum residency (1 sample(s))
13,048 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 10515 collections, 0 parallel, 0.06s, 0.04s elapsed
Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 3.20s ( 3.25s elapsed)
GC time 0.06s ( 0.04s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 3.26s ( 3.29s elapsed)
%GC time 1.9% (1.2% elapsed)
Alloc rate 1,723,838,984 bytes per MUT second
Productivity 98.1% of total user, 97.1% of total elapsed
Alright, so we're seeing an interesting pattern.
Same amount of memory used. This means that this is a pretty good implementation, although it could mean that we need to test on higher sample inputs to see if we can find a difference.
It takes twice as long. We'll come back to some speculation as to why this is later.
It's actually slightly more productive, but given that GC is not a huge portion of either program this doesn't help us anything significant.
Version 3:
digits3 +RTS -sstderr
12000000
3,231,154,752 bytes allocated in the heap
832,724 bytes copied during GC
3,292 bytes maximum residency (1 sample(s))
12,100 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 6163 collections, 0 parallel, 0.02s, 0.02s elapsed
Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 2.09s ( 2.08s elapsed)
GC time 0.02s ( 0.02s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 2.11s ( 2.10s elapsed)
%GC time 0.7% (1.0% elapsed)
Alloc rate 1,545,701,615 bytes per MUT second
Productivity 99.3% of total user, 99.3% of total elapsed
Alright, so we're seeing some strange patterns.
We're still at 1MB total memory in use. So we haven't hit anything memory-inefficient, which is good.
We're not quite at digits1, but we've got digits2 beat pretty easily.
Very little GC. (Keep in mind that anything over 95% productivity is very good, so we're not really dealing with anything too significant here.)
And finally, version 4:
digits4 +RTS -sstderr
12000000
1,347,856,636 bytes allocated in the heap
270,692 bytes copied during GC
3,180 bytes maximum residency (1 sample(s))
12,100 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 2570 collections, 0 parallel, 0.00s, 0.01s elapsed
Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.09s ( 1.08s elapsed)
GC time 0.00s ( 0.01s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 1.09s ( 1.09s elapsed)
%GC time 0.0% (0.8% elapsed)
Alloc rate 1,234,293,036 bytes per MUT second
Productivity 100.0% of total user, 100.5% of total elapsed
Wowza! Let's break it down:
We're still at 1MB total. This is almost certainly a feature of these implementations, as they remain at 1MB on inputs of 5e5 and 5e7. A testament to laziness, if you will.
We cut off about 32% of our original time, which is pretty impressive.
I suspect that the percentages here reflect the granularity in -sstderr's monitoring rather than any computation on superluminal particles.

Answering the question "why rem instead of mod?" in the comments. When dealing with positive values rem x y === mod x y so the only consideration of note is performance:
> import Test.QuickCheck
> quickCheck (\x y -> x > 0 && y > 0 ==> x `rem` y == x `mod` y)
So what is the performance? Unless you have a good reason not to (and being lazy isn't a good reason, neither is not knowing Criterion) then use a good benchmark tool, I used Criterion:
$ cat useRem.hs
import Criterion
import Criterion.Main
list :: [Integer]
list = [1..10000]
main = defaultMain
[ bench "mod" (nf (map (`mod` 7)) list)
, bench "rem" (nf (map (`rem` 7)) list)
]
Running this shows rem is measurably better than mod (compiled with -O2):
$ ./useRem
...
benchmarking mod
...
mean: 590.4692 us, lb 589.2473 us, ub 592.1766 us, ci 0.950
benchmarking rem
...
mean: 394.1580 us, lb 393.2415 us, ub 395.4184 us, ci 0.950

Related

GCC optimization flag -O2 makes code much slower that -O0 [duplicate]

I wanted to benchmark glibc's strlen function for some reason and found out it apparently performs much slower with optimizations enabled in GCC and I have no idea why.
Here's my code:
#include <time.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
int main() {
char *s = calloc(1 << 20, 1);
memset(s, 65, 1000000);
clock_t start = clock();
for (int i = 0; i < 128; ++i) {
s[strlen(s)] = 'A';
}
clock_t end = clock();
printf("%lld\n", (long long)(end - start));
return 0;
}
On my machine it outputs:
$ gcc test.c && ./a.out
13336
$ gcc -O1 test.c && ./a.out
199004
$ gcc -O2 test.c && ./a.out
83415
$ gcc -O3 test.c && ./a.out
83415
Somehow, enabling optimizations causes it to execute longer.
Testing your code on Godbolt's Compiler Explorer provides this explanation:
at -O0 or without optimisations, the generated code calls the C library function strlen;
at -O1 the generated code uses a simple inline expansion using a rep scasb instruction;
at -O2 and above, the generated code uses a more elaborate inline expansion.
Benchmarking your code repeatedly shows substantial variations from one run to another, but increasing the number of iterations shows that:
the -O1 code is much slower than the C library implementation: 32240 vs 3090
the -O2 code is faster than the -O1 but still substantially slower than the C library code: 8570 vs 3090.
This behavior is specific to gcc and the GNU libc. The same test on OS/X with clang and Apple's Libc does not show significant differences, which is not a surprise as Godbolt shows that clang generates a call to the C library strlen at all optimisation levels.
This could be considered a bug in gcc/glibc but more extensive benchmarking might show that the overhead of calling strlen has a more important impact than the lack of performance of the inline code for small strings. The strings in your benchmark are uncommonly large, so focusing the benchmark on ultra-long strings might not give meaningful results.
I improved this benchmark and tested various string lengths. It appears from the benchmarks on linux with gcc (Debian 4.7.2-5) 4.7.2 running on an Intel(R) Core(TM) i3-2100 CPU # 3.10GHz that the inline code generated by -O1 is always slower, by as much as a factor of 10 for moderately long strings, while -O2 is only slightly faster than the libc strlen for very short strings and half as fast for longer strings. From this data, the GNU C library version of strlen is quite efficient for most string lengths, at least on my specific hardware. Also keeping in mind that cacheing has a major impact on benchmark measurements.
Here is the updated code:
#include <stdlib.h>
#include <string.h>
#include <time.h>
void benchmark(int repeat, int minlen, int maxlen) {
char *s = malloc(maxlen + 1);
memset(s, 'A', minlen);
long long bytes = 0, calls = 0;
clock_t clk = clock();
for (int n = 0; n < repeat; n++) {
for (int i = minlen; i < maxlen; ++i) {
bytes += i + 1;
calls += 1;
s[i] = '\0';
s[strlen(s)] = 'A';
}
}
clk = clock() - clk;
free(s);
double avglen = (minlen + maxlen - 1) / 2.0;
double ns = (double)clk * 1e9 / CLOCKS_PER_SEC;
printf("average length %7.0f -> avg time: %7.3f ns/byte, %7.3f ns/call\n",
avglen, ns / bytes, ns / calls);
}
int main() {
benchmark(10000000, 0, 1);
benchmark(1000000, 0, 10);
benchmark(1000000, 5, 15);
benchmark(100000, 0, 100);
benchmark(100000, 50, 150);
benchmark(10000, 0, 1000);
benchmark(10000, 500, 1500);
benchmark(1000, 0, 10000);
benchmark(1000, 5000, 15000);
benchmark(100, 1000000 - 50, 1000000 + 50);
return 0;
}
Here is the output:
chqrlie> gcc -std=c99 -O0 benchstrlen.c && ./a.out
average length 0 -> avg time: 14.000 ns/byte, 14.000 ns/call
average length 4 -> avg time: 2.364 ns/byte, 13.000 ns/call
average length 10 -> avg time: 1.238 ns/byte, 13.000 ns/call
average length 50 -> avg time: 0.317 ns/byte, 16.000 ns/call
average length 100 -> avg time: 0.169 ns/byte, 17.000 ns/call
average length 500 -> avg time: 0.074 ns/byte, 37.000 ns/call
average length 1000 -> avg time: 0.068 ns/byte, 68.000 ns/call
average length 5000 -> avg time: 0.064 ns/byte, 318.000 ns/call
average length 10000 -> avg time: 0.062 ns/byte, 622.000 ns/call
average length 1000000 -> avg time: 0.062 ns/byte, 62000.000 ns/call
chqrlie> gcc -std=c99 -O1 benchstrlen.c && ./a.out
average length 0 -> avg time: 20.000 ns/byte, 20.000 ns/call
average length 4 -> avg time: 3.818 ns/byte, 21.000 ns/call
average length 10 -> avg time: 2.190 ns/byte, 23.000 ns/call
average length 50 -> avg time: 0.990 ns/byte, 50.000 ns/call
average length 100 -> avg time: 0.816 ns/byte, 82.000 ns/call
average length 500 -> avg time: 0.679 ns/byte, 340.000 ns/call
average length 1000 -> avg time: 0.664 ns/byte, 664.000 ns/call
average length 5000 -> avg time: 0.651 ns/byte, 3254.000 ns/call
average length 10000 -> avg time: 0.649 ns/byte, 6491.000 ns/call
average length 1000000 -> avg time: 0.648 ns/byte, 648000.000 ns/call
chqrlie> gcc -std=c99 -O2 benchstrlen.c && ./a.out
average length 0 -> avg time: 10.000 ns/byte, 10.000 ns/call
average length 4 -> avg time: 2.000 ns/byte, 11.000 ns/call
average length 10 -> avg time: 1.048 ns/byte, 11.000 ns/call
average length 50 -> avg time: 0.337 ns/byte, 17.000 ns/call
average length 100 -> avg time: 0.299 ns/byte, 30.000 ns/call
average length 500 -> avg time: 0.202 ns/byte, 101.000 ns/call
average length 1000 -> avg time: 0.188 ns/byte, 188.000 ns/call
average length 5000 -> avg time: 0.174 ns/byte, 868.000 ns/call
average length 10000 -> avg time: 0.172 ns/byte, 1716.000 ns/call
average length 1000000 -> avg time: 0.172 ns/byte, 172000.000 ns/call
GCC's inline strlen patterns are much slower than what it could do with SSE2 pcmpeqb / pmovmskb, and bsf, given the 16-byte alignment from calloc. This "optimization" is actually a pessimization.
My simple hand-written loop that takes advantage of 16-byte alignment is 5x faster than what gcc -O3 inlines for large buffers, and ~2x faster for short strings. (And faster than calling strlen for short strings). I've added a comment to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809 to propose this for what gcc should inline at -O2 / -O3 when it's able. (With a suggestion for ramping up to 16-byte if we only know 4-byte alignment to start with.)
When gcc knows it has 4-byte alignment for the buffer (guaranteed by calloc), it chooses to inline strlen as a 4-byte-at-a-time scalar bithack using GP integer registers (-O2 and higher).
(Reading 4 bytes at a time is only safe if we know we can't cross into a page that doesn't contain any string bytes, and thus might be unmapped. Is it safe to read past the end of a buffer within the same page on x86 and x64? (TL:DR yes, in asm it is, so compilers can emit code that does that even if doing so in the C source is UB. libc strlen implementations also take advantage of that. See my answer there for links to glibc strlen and a summary of how it runs so fast for large strings.)
At -O1, gcc always (even without known alignment) chooses to inline strlen as repnz scasb, which is very slow (about 1 byte per clock cycle on modern Intel CPUs). "Fast strings" only applies to rep stos and rep movs, not the repz/repnz instructions, unfortunately. Their microcode is just simple 1 byte at a time, but they still have some startup overhead. (https://agner.org/optimize/)
(We can test this by "hiding" the pointer from the compiler by storing / reloading s to a volatile void *tmp, for example. gcc has to make zero assumptions about the pointer value that's read back from a volatile, destroying any alignment info.)
GCC does have some x86 tuning options like -mstringop-strategy=libcall vs. unrolled_loop vs. rep_byte for inlining string operations in general (not just strlen; memcmp would be another major one that can be done with rep or a loop). I haven't checked what effect these have here.
The docs for another option also describe the current behaviour. We could get this inlining (with extra code for alignment-handling) even in cases where we wanted it on unaligned pointers. (This used to be an actual perf win, especially for small strings, on targets where the inline loop wasn't garbage compared to what the machine can do.)
-minline-all-stringops
By default GCC inlines string operations only when the destination is known to be aligned to least a 4-byte boundary. This enables more inlining and increases code size, but may improve performance of code that depends on fast memcpy, strlen, and memset for short lengths.
GCC also has per-function attributes you can apparently use to control this, like __attribute__((no-inline-all-stringops)) void foo() { ... }, but I haven't played around with it. (That's the opposite of inline-all. It doesn't mean inline none, it just goes back to only inlining when 4-byte alignment is known.)
Both of gcc's inline strlen strategies fail to take advantage of 16-byte alignment, and are pretty bad for x86-64
Unless the small-string case is very common, doing one 4-byte chunk, then aligned 8-byte chunks would go about twice as fast as 4-byte.
And the 4-byte strategy has much slower cleanup than necessary for finding the byte within the dword containing the zero byte. It detects this by looking for a byte with its high bit set, so it should just mask off the other bits and use bsf (bit-scan forward). That has 3 cycle latency on modern CPUs (Intel and Ryzen). Or compilers can use rep bsf so it runs as tzcnt on CPUs that support BMI1, which is more efficient on AMD. bsf and tzcnt give the same result for non-zero inputs.
GCC's 4-byte loop looks like it's compiled from pure C, or some target-independent logic, not taking advantage of bitscan. gcc does use andn to optimize it when compiling for x86 with BMI1, but it's still less than 4 bytes per cycle.
SSE2 pcmpeqb + bsf is much much better for both short and long inputs. x86-64 guarantees that SSE2 is available, and the x86-64 System V has alignof(maxalign_t) = 16 so calloc will always return pointers that are at least 16-byte aligned.
I wrote a replacement for the strlen block to test performance
As expected it's about 4x faster on Skylake going 16 bytes at a time instead of 4.
(I compiled the original source to asm with -O3, then edited the asm to see what performance should have been with this strategy for inline expansion of strlen. I also ported it to inline asm inside the C source; see that version on Godbolt.)
# at this point gcc has `s` in RDX, `i` in ECX
pxor %xmm0, %xmm0 # zeroed vector to compare against
.p2align 4
.Lstrlen16: # do {
#ifdef __AVX__
vpcmpeqb (%rdx), %xmm0, %xmm1
#else
movdqa (%rdx), %xmm1
pcmpeqb %xmm0, %xmm1 # xmm1 = -1 where there was a 0 in memory
#endif
add $16, %rdx # ptr++
pmovmskb %xmm1, %eax # extract high bit of each byte to a 16-bit mask
test %eax, %eax
jz .Lstrlen16 # }while(mask==0);
# RDX points at the 16-byte chunk *after* the one containing the terminator
# EAX = bit-mask of the 0 bytes, and is known to be non-zero
bsf %eax, %eax # EAX = bit-index of the lowest set bit
movb $'A', -16(%rdx, %rax)
Note that I optimized part of the strlen cleanup into the store addressing mode: I correct for the overshoot with the -16 displacement, and that this is just finding the end of the string, not actually calculating the length and then indexing like GCC was already doing after inlining its 4-byte-at-a-time loop.
To get actual string length (instead of pointer to the end), you'd subtract rdx-start and then add rax-16 (maybe with an LEA to add 2 registers + a constant, but 3-component LEA has more latency.)
With AVX to allow load+compare in one instruction without destroying the zeroed register, the whole loop is only 4 uops, down from 5. (test/jz macro-fuses into one uop on both Intel and AMD. vpcmpeqb with a non-indexed memory-source can keep it micro-fused through the whole pipeline, so it's only 1 fused-domain uop for the front-end.)
(Note that mixing 128-bit AVX with SSE does not cause stalls even on Haswell, as long as you're in clean-upper state to start with. So I didn't bother about changing the other instructions to AVX, only the one that mattered. There seemed to be some minor effect where pxor was actually slightly better than vpxor on my desktop, though, for an AVX loop body. It seemed somewhat repeatable, but it's weird because there's no code-size difference and thus no alignment difference.)
pmovmskb is a single-uop instruction. It has 3-cycle latency on Intel and Ryzen (worse on Bulldozer-family). For short strings, the trip through the SIMD unit and back to integer is an important part of the critical path dependency chain for latency from input memory bytes to store-address being ready. But only SIMD has packed-integer compares, so scalar would have to do more work.
For the very-small string case (like 0 to 3 bytes), it might be possible to achieve slightly lower latency for that case by using pure scalar (especially on Bulldozer-family), but having all strings from 0 to 15 bytes take the same branch path (loop branch never taken) is very nice for most short-strings use-cases.
Being very good for all strings up to 15 bytes seems like a good choice, when we know we have 16-byte alignment. More predictable branching is very good. (And note that when looping, pmovmskb latency only affects how quickly we can detect branch mispredicts to break out of the loop; branch prediction + speculative execution hides the latency of the independent pmovmskb in each iteration.
If we expected longer strings to be common, we could unroll a bit, but at that point you should just call the libc function so it can dispatch to AVX2 if available at runtime. Unrolling to more than 1 vector complicates the cleanup, hurting the simple cases.
On my machine i7-6700k Skylake at 4.2GHz max turbo (and energy_performance_preference = performance), with gcc8.2 on Arch Linux, I get somewhat consistent benchmark timing because my CPU clock speed ramps up during the memset. But maybe not always to max turbo; Skylake's hw power management downclocks when memory-bound. perf stat showed I typically got right around 4.0GHz when running this to average the stdout output and see perf summary on stderr.
perf stat -r 100 ./a.out | awk '{sum+= $1} END{print sum/100;}'
I ended up copying my asm into a GNU C inline-asm statement, so I could put the code on the Godbolt compiler explorer.
For large strings, same length as in the question: times on ~4GHz Skylake
~62100 clock_t time units: -O1 rep scas: (clock() is a bit obsolete, but I didn't bother changing it.)
~15900 clock_t time units: -O3 gcc 4-byte loop strategy: avg of 100 runs = . (Or maybe ~15800 with -march=native for andn)
~1880 clock_t time units: -O3 with glibc strlen function calls, using AVX2
~3190 clock_t time units: (AVX1 128-bit vectors, 4 uop loop) hand-written inline asm that gcc could/should inline.
~3230 clock_t time units: (SSE2 5 uop loop) hand-written inline asm that gcc could/should inline.
My hand-written asm should be very good for short strings, too, because it doesn't need to branch specially. Known alignment is very good for strlen, and libc can't take advantage of it.
If we expect large strings to be rare, 1.7x slower than libc for that case. The length of 1M bytes means it won't be staying hot in L2 (256k) or L1d cache (32k) on my CPU, so even bottlenecked on L3 cache the libc version was faster. (Probably an unrolled loop and 256-bit vectors doesn't clog up the ROB with as many uops per byte, so OoO exec can see farther ahead and get more memory parallelism, especially at page boundaries.)
But L3 cache bandwidth is probably a bottleneck stopping the 4-uop version from running at 1 iteration per clock, so we're seeing less benefit from AVX saving us a uop in the loop. With data hot in L1d cache, we should get 1.25 cycles per iteration vs. 1.
But a good AVX2 implementation can read up to 64 bytes per cycle (2x 32 byte loads) using vpminub to combine pairs before checking for zeros and going back to find where they were. The gap between this and libc opens wider for sizes of ~2k to ~30 kiB or so that stay hot in L1d.
Some read-only testing with length=1000 indicates that glibc strlen really is about 4x faster than my loop for medium size strings hot in L1d cache. That's large enough for AVX2 to ramp up to the big unrolled loop, but still easily fits in L1d cache. (Read-only avoid store-forwarding stalls, and so we can do many iterations)
If your strings are that big, you should be using explicit-length strings instead of needing to strlen at all, so inlining a simple loop still seems like a reasonable strategy, as long as it's actually good for short strings and not total garbage for medium (like 300 bytes) and very long (> cache size) strings.
Benchmarking small strings with this:
I ran into some oddities in trying to get the results I expected:
I tried s[31] = 0 to truncate the string before every iteration (allowing short constant length). But then my SSE2 version was almost the same speed as GCC's version. Store-forwarding stalls were the bottleneck! A byte store followed by a wider load makes store-forwarding take the slow path that merges bytes from the store buffer with bytes from L1d cache. This extra latency is part of a loop-carried dep chain through the last 4-byte or 16-byte chunk of the string, to calculate the store index for the next iteration.
GCC's slower 4-byte-at-a-time code could keep up by processing the earlier 4-byte chunks in the shadow of that latency. (Out-of-order execution is pretty fantastic: slow code can sometimes not affect the overall speed of your program).
I eventually solved it by making a read-only version, and using inline asm to stop the compiler from hoisting strlen out of the loop.
But store-forwarding is a potential issue with using 16-byte loads. If other C variables are stored past the end of the array, we might hit a SF stall due to loading off the end of the array farther than with narrower stores. For recently-copied data, we're fine if it was copied with 16-byte or wider aligned stores, but glibc memcpy for small copies does 2x overlapping loads that cover the whole object, from the start and end of the object. Then it stores both, again overlapping, handling the memmove src overlaps dst case for free. So the 2nd 16-byte or 8-byte chunk of a short string that was just memcpyied might give us a SF stall for reading the last chunk. (The one that has the data dependency for the output.)
Just running slower so you don't get to the end before it's ready isn't good in general, so there's no great solution here. I think most of the time you're not going to strlen a buffer you just wrote, usually you're going to strlen an input that you're only reading so store-forwarding stalls aren't a problem. If something else just wrote it, then efficient code hopefully wouldn't have thrown away the length and called a function that required recalculating it.
Other weirdness I haven't totally figured out:
Code alignment is making a factor of 2 difference for read-only, size=1000 (s[1000] = 0;). But the inner-most asm loop itself is aligned with .p2align 4 or .p2align 5. Increasing the loop alignment can slow it down by a factor of 2!
# slow version, with *no* extra HIDE_ALIGNMENT function call before the loop.
# using my hand-written asm, AVX version.
i<1280000 read-only at strlen(s)=1000 so strlen time dominates the total runtime (not startup overhead)
.p2align 5 in the asm inner loop. (32-byte code alignment with NOP padding)
gcc -DUSE_ASM -DREAD_ONLY -DHIDE_ALIGNMENT -march=native -O3 -g strlen-microbench.c &&
time taskset -c 3 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread -r 100 ./a.out |
awk '{sum+= $1} END{print sum/100;}'
Performance counter stats for './a.out' (100 runs):
40.92 msec task-clock # 0.996 CPUs utilized ( +- 0.20% )
2 context-switches # 0.052 K/sec ( +- 3.31% )
0 cpu-migrations # 0.000 K/sec
313 page-faults # 0.008 M/sec ( +- 0.05% )
168,103,223 cycles # 4.108 GHz ( +- 0.20% )
82,293,840 branches # 2011.269 M/sec ( +- 0.00% )
1,845,647 branch-misses # 2.24% of all branches ( +- 0.74% )
412,769,788 instructions # 2.46 insn per cycle ( +- 0.00% )
466,515,986 uops_issued.any # 11401.694 M/sec ( +- 0.22% )
487,011,558 uops_executed.thread # 11902.607 M/sec ( +- 0.13% )
0.0410624 +- 0.0000837 seconds time elapsed ( +- 0.20% )
40326.5 (clock_t)
real 0m4.301s
user 0m4.050s
sys 0m0.224s
Note branch misses definitely non-zero, vs. almost exactly zero for the fast version. And uops issued is much higher than the fast version: it may be speculating down the wrong path for a long time on each of those branch misses.
Probably the inner and outer loop-branches are aliasing each other, or not.
Instruction count is nearly identical, just different by some NOPs in the outer loop ahead of the inner loop. But IPC is vastly different: without problems, the fast version runs an average of 4.82 instructions per clock for the whole program. (Most of that is in the inner-most loop running 5 instructions per cycle, thanks to a test/jz that macro-fuses 2 instructions into 1 uop.) And note that uops_executed is much higher than uops_issued: that means micro-fusion is working well to get more uops through the front-end bottleneck.
fast version, same read-only strlen(s)=1000 repeated 1280000 times
gcc -DUSE_ASM -DREAD_ONLY -UHIDE_ALIGNMENT -march=native -O3 -g strlen-microbench.c &&
time taskset -c 3 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread -r 100 ./a.out |
awk '{sum+= $1} END{print sum/100;}'
Performance counter stats for './a.out' (100 runs):
21.06 msec task-clock # 0.994 CPUs utilized ( +- 0.10% )
1 context-switches # 0.056 K/sec ( +- 5.30% )
0 cpu-migrations # 0.000 K/sec
313 page-faults # 0.015 M/sec ( +- 0.04% )
86,239,943 cycles # 4.094 GHz ( +- 0.02% )
82,285,261 branches # 3906.682 M/sec ( +- 0.00% )
17,645 branch-misses # 0.02% of all branches ( +- 0.15% )
415,286,425 instructions # 4.82 insn per cycle ( +- 0.00% )
335,057,379 uops_issued.any # 15907.619 M/sec ( +- 0.00% )
409,255,762 uops_executed.thread # 19430.358 M/sec ( +- 0.00% )
0.0211944 +- 0.0000221 seconds time elapsed ( +- 0.10% )
20504 (clock_t)
real 0m2.309s
user 0m2.085s
sys 0m0.203s
I think it's just the branch prediction, not other front-end stuff that's a problem. The test/branch instructions aren't getting split across a boundary that would prevent macro-fusion.
Changing .p2align 5 to .p2align 4 reverses them: -UHIDE_ALIGNMENT becomes slow.
This Godbolt binary link reproduces the same padding I'm seeing with gcc8.2.1 on Arch Linux for both cases: 2x 11-byte nopw + a 3-byte nop inside the outer loop for the fast case. It also has the exact source I was using locally.
short strlen read-only micro-benchmarks:
Tested with stuff chosen so it doesn't suffer from branch mispredicts or store-forwarding, and can test the same short length repeatedly for enough iterations to get meaningful data.
strlen=33, so the terminator is near the start of the 3rd 16-byte vector. (Makes my version look as bad as possible vs. the 4-byte version.) -DREAD_ONLY, and i<1280000 as an outer-loop repeat loop.
1933 clock_t: my asm: nice and consistent best-case time (not noisy / bouncing around when re-running the average.) Equal perf with/without -DHIDE_ALIGNMENT, unlike for the longer strlen. The loop branch is much more easily predictable with that much shorter pattern. (strlen=33, not 1000).
3220 clock_t: gcc -O3 call glibc strlen. (-DHIDE_ALIGNMENT)
6100 clock_t: gcc -O3 4-byte loop
37200 clock_t: gcc -O1 repz scasb
So for short strings, my simple inline loop beats a library function call to strlen that has to go through the PLT (call + jmp [mem]), then run strlen's startup overhead that can't depend on alignment.
There were negligible branch-mispredicts, like 0.05% for all the versions with strlen(s)=33. The repz scasb version had 0.46%, but that's out of fewer total branches. No inner loop to rack up many correctly predicted branches.
With branch predictors and code-cache hot, repz scasb is more than 10x worse than calling glibc strlen for a 33-byte string. It would be less bad in real use cases where strlen could branch miss or even miss in code-cache and stall, but straight-line repz scasb wouldn't. But 10x is huge, and that's for a fairly short string.

Reading Go gctrace output

I have gctrace output that looks like this:
gc 6 #48.155s 15%: 0.093+12360+0.32 ms clock, 0.18+7720/21356/3615+0.65 ms cpu, 11039->13278->6876 MB, 14183 MB goal, 8 P
I am not sure how to read the CPU times in particular. I understand that it is broken down into three phases (STW sweep termination, concurrent mark/scan, and STW mark termination), but I'm not sure what the + signs mean (i.e. 0.18+7720 and 3615+0.65). What do these + signs signify?
In your case, they look like assist and termination times;
// CPU time
0.18 : **STW** Sweep termination.
7720ms : Mark/Scan - Assist Time (GC performed in line with allocation).
21356ms : Mark/Scan - Background GC time.
3615ms : Mark/Scan - Idle GC time.
0.65ms : **STW** Mark termination.
I think it changes (or it may) over various Go versions and you can find more detailed info at runtime package docs.
Currently, it is:
gc # ##s #%: #+#+# ms clock, #+#/#/#+# ms cpu, #->#-># MB, # MB goal, # P
where the fields are as follows:
gc # the GC number, incremented at each GC
##s time in seconds since program start
#% percentage of time spent in GC since program start
#+...+# wall-clock/CPU times for the phases of the GC
#->#-># MB heap size at GC start, at GC end, and live heap
# MB goal goal heap size
# P number of processors used
Example here
See also Interpreting GC trace output
gc 6 #48.155s 15%: 0.093+12360+0.32 ms clock,
0.18+7720/21356/3615+0.65 ms cpu, 11039->13278->6876 MB, 14183 MB goal, 8 P
gc 6
#48.155s since program start
15%: of time spent in GC since program start
0.093+12360+0.32 ms clock stop-the-world (STW) sweep termination + concurrent
mark and scan + and STW mark termination
0.18+7720/21356/3615+0.65 ms cpu (GC performed in
line with allocation), background GC time, and idle GC time
11039->13278->6876 MB heap size at GC start, at GC end, and live heap
8 P number of processors used

Why does reserving capacity in a gvector in Racket make performance worse?

Using the following simple benchmark in Racket 6.6:
#lang racket
(require data/gvector)
(define (run)
;; this should have to periodically resize in order to incorporate new data
;; and thus should be slower
(time (define v (make-gvector)) (for ((i (range 1000000))) (gvector-add! v i)) )
(collect-garbage 'major)
;; this should never have to resize and thus should be faster
;; ... but consistently benchmarks slower?!
(time (define v (make-gvector #:capacity 1000000)) (for ((i (range 1000000))) (gvector-add! v i)) )
)
(run)
The version that properly reserves capacity does worse consistently. Why? This is certainly not the result that I would expect, and is inconsistent with what you would see in C++ (std::vector) or Java (ArrayList). Am I somehow benchmarking incorrectly?
Example output:
cpu time: 232 real time: 230 gc time: 104
cpu time: 228 real time: 230 gc time: 120
One benchmarking comment: use in-range instead of range in your microbenchmarks; otherwise you're including the cost of constructing a million-element list in your measurements.
I added some extra loops to your microbenchmark to make it do more work (and I fixed the range issue). Here are some of the results:
Using #:capacity for large capacities is slower.
== 5 iterations of 1e7 sized gvector, measured 3 times each way
with #:capacity
cpu time: 9174 real time: 9169 gc time: 4769
cpu time: 9109 real time: 9108 gc time: 4683
cpu time: 9094 real time: 9091 gc time: 4670
without
cpu time: 7917 real time: 7912 gc time: 3243
cpu time: 7703 real time: 7697 gc time: 3107
cpu time: 7732 real time: 7727 gc time: 3115
Using #:capacity for small capacities is faster.
== 20 iterations of 1e6 sized gvector, measured three times each way
with #:capacity
cpu time: 2167 real time: 2168 gc time: 408
cpu time: 2152 real time: 2152 gc time: 385
cpu time: 2112 real time: 2111 gc time: 373
without
cpu time: 2310 real time: 2308 gc time: 473
cpu time: 2316 real time: 2315 gc time: 480
cpu time: 2335 real time: 2334 gc time: 488
My hypothesis: it's GC overhead. When the backing vector is mutated, Racket's generational GC remembers the vector so it can scan it in the next minor collection. When the backing vector is very big, scanning the whole vector on every minor GC outweighs the cost of reallocation and copying. The overhead wouldn't occur with a GC with a finer remembered-set granularity (but... tradeoffs).
BTW, looking over the gvector code I found a couple opportunities for improvement. They don't change the big picture, though.
Increasing the vector size with a factor 10 I get the following in DrRacket
(with all debugging turned off):
cpu time: 5245 real time: 5605 gc time: 3607
cpu time: 4851 real time: 5136 gc time: 3231
Note: If there is garbage left over from the first benchmark it can affect the next one. Therefore use collect-garbage (three times) before using time again.
Also... don't make benchmarks in DrRacket as I did - use the command line.

Benchmarking in Ruby

I have been doing some benchmarking in Ruby and have the following results:
user system total real
part1 0.156000 0.000000 0.156000 ( 0.158009)
user system total real
part2 0.015000 0.000000 0.015000 ( 0.162010)
Commonly, as in part1, the total and real times are nearly the same. However this is not true in part 2.
What is the meaning of the total and real divergence in part2?
Does the divergence raise any concerns?
What run is faster?
user/system are cpu times, measured by the kernel. which have scheduled your
process.
real time is the time of calculation.
So real time bigger than user+system mean :
io or sleep in code tested
there a other process/daemon which consumes CPU
The results are organized in columns and are in this order; user CPU time, system CPU time, the sum of the user and system CPU times, and the elapsed real time. The units of them all are seconds. So in real time, part1 was faster than part2.

Forcing sequential processing in Haskell's Data.Binary.Get

After trying to import the basic Java runtime library rt.jar with language-java-classfile, I've discovered that it uses huge amounts of memory.
I've reduced the program demonstrating the problem to 100 lines and uploaded it to hpaste. Without forcing the evaluation of stream in line #94, I have no chance of ever running it because it eats up all my memory. Forcing stream before passing it to getClass finishes, but still uses up huge amounts of memory:
34,302,587,664 bytes allocated in the heap
32,583,990,728 bytes copied during GC
139,810,024 bytes maximum residency (398 sample(s))
29,142,240 bytes maximum slop
281 MB total memory in use (4 MB lost due to fragmentation)
Generation 0: 64992 collections, 0 parallel, 38.07s, 37.94s elapsed
Generation 1: 398 collections, 0 parallel, 25.87s, 27.78s elapsed
INIT time 0.01s ( 0.00s elapsed)
MUT time 37.22s ( 36.85s elapsed)
GC time 63.94s ( 65.72s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 13.00s ( 13.18s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 114.17s (115.76s elapsed)
%GC time 56.0% (56.8% elapsed)
Alloc rate 921,369,531 bytes per MUT second
Productivity 32.6% of total user, 32.2% of total elapsed
I thought the problem was the ConstTables staying around, so I tried forcing cls in line #94 as well. But this only makes the memory consumption and the runtime worse:
34,300,700,520 bytes allocated in the heap
23,579,794,624 bytes copied during GC
487,798,904 bytes maximum residency (423 sample(s))
36,312,104 bytes maximum slop
554 MB total memory in use (10 MB lost due to fragmentation)
Generation 0: 64983 collections, 0 parallel, 71.19s, 71.48s elapsed
Generation 1: 423 collections, 0 parallel, 344.74s, 353.01s elapsed
INIT time 0.01s ( 0.00s elapsed)
MUT time 40.60s ( 42.38s elapsed)
GC time 415.93s (424.49s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 56.53s ( 57.71s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 513.07s (524.58s elapsed)
%GC time 81.1% (80.9% elapsed)
Alloc rate 844,636,801 bytes per MUT second
Productivity 7.9% of total user, 7.7% of total elapsed
So my question is basically, how do I force sequential processing of the files involved, so that after each one is processed, only the string result (cls) remains in memory?
Edit 2: I just realized your code does this:
stream <- BL.pack <$> fileContents [] classfile
Don't do that. The pack functions are notoriously slow. You'll need to find a solution that doesn't involve using pack to create a ByteString.
I'm leaving the rest of my answer because I still think it applies, but this is almost certainly the biggest problem.
Unfortunately I can't test this because I don't recognize all your imports.
If you only want the result cls to remain in memory, why don't you force it instead of forcing stream? Change line 94 to
cls `seq` return cls
It may be necessary to use deepseq instead of just seq, although I have a suspicion that plain seq will be sufficient here.
However I think there's a better solution, and that's to use mapM_ instead of mapM. I think it's usually better style (and nearly always better performance) to create a function that does what it's supposed to with each result rather than returning a list. Here, you can change your main function to:
main = do
withArchive [CheckConsFlag] jarPath $ do
classfiles <- filter isClassfile <$> fileNames []
forM_ classfiles $ \classfile -> do
stream <- BL.pack <$> fileContents [] classfile
let cls = runGet getClass stream
lift $ print cls
Now the print is lifted into the function passed to forM_ for each classfile. The value cls is used internally and never returned, so it's both fully evaluated and quickly GC'd on each iteration of forM_.
Making use of this style in a larger application may require some refactoring or even redesign, but the results may be worth it.
Edit: If you're going to the trouble to redesign your code, you could use iteratees and avoid this problem entirely.
Your idea to force evaluation of cls in line 94 was right. But I guess you're approach to do so wasn't successfull. See this paste for my version, which runs in ca. 40MB instead of 220MB.
The key is to force reduction to normal form of cls, which is done by rnf cls. And this has to happen before the call to return. Therefore:
rnf cls `seq` return cls
Alternatively, you could use Control.Exception.evaluate:
evaluate $ rnf cls
return cls
Thanks for the suggestions.
I think for my concrete problem, the solution will be to process .jar files in small chunks -- fortunately, inner classes are always in the same dir in the .jar file as their outer class, so there is no need to process all 50 megs in one run.
The only thing I couldn't quite understand is if it is possible to use libzip via enumerators, or would that need a new libzip implementation?

Resources