After trying to import the basic Java runtime library rt.jar with language-java-classfile, I've discovered that it uses huge amounts of memory.
I've reduced the program demonstrating the problem to 100 lines and uploaded it to hpaste. Without forcing the evaluation of stream in line #94, I have no chance of ever running it because it eats up all my memory. Forcing stream before passing it to getClass finishes, but still uses up huge amounts of memory:
34,302,587,664 bytes allocated in the heap
32,583,990,728 bytes copied during GC
139,810,024 bytes maximum residency (398 sample(s))
29,142,240 bytes maximum slop
281 MB total memory in use (4 MB lost due to fragmentation)
Generation 0: 64992 collections, 0 parallel, 38.07s, 37.94s elapsed
Generation 1: 398 collections, 0 parallel, 25.87s, 27.78s elapsed
INIT time 0.01s ( 0.00s elapsed)
MUT time 37.22s ( 36.85s elapsed)
GC time 63.94s ( 65.72s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 13.00s ( 13.18s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 114.17s (115.76s elapsed)
%GC time 56.0% (56.8% elapsed)
Alloc rate 921,369,531 bytes per MUT second
Productivity 32.6% of total user, 32.2% of total elapsed
I thought the problem was the ConstTables staying around, so I tried forcing cls in line #94 as well. But this only makes the memory consumption and the runtime worse:
34,300,700,520 bytes allocated in the heap
23,579,794,624 bytes copied during GC
487,798,904 bytes maximum residency (423 sample(s))
36,312,104 bytes maximum slop
554 MB total memory in use (10 MB lost due to fragmentation)
Generation 0: 64983 collections, 0 parallel, 71.19s, 71.48s elapsed
Generation 1: 423 collections, 0 parallel, 344.74s, 353.01s elapsed
INIT time 0.01s ( 0.00s elapsed)
MUT time 40.60s ( 42.38s elapsed)
GC time 415.93s (424.49s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 56.53s ( 57.71s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 513.07s (524.58s elapsed)
%GC time 81.1% (80.9% elapsed)
Alloc rate 844,636,801 bytes per MUT second
Productivity 7.9% of total user, 7.7% of total elapsed
So my question is basically, how do I force sequential processing of the files involved, so that after each one is processed, only the string result (cls) remains in memory?
Edit 2: I just realized your code does this:
stream <- BL.pack <$> fileContents [] classfile
Don't do that. The pack functions are notoriously slow. You'll need to find a solution that doesn't involve using pack to create a ByteString.
I'm leaving the rest of my answer because I still think it applies, but this is almost certainly the biggest problem.
Unfortunately I can't test this because I don't recognize all your imports.
If you only want the result cls to remain in memory, why don't you force it instead of forcing stream? Change line 94 to
cls `seq` return cls
It may be necessary to use deepseq instead of just seq, although I have a suspicion that plain seq will be sufficient here.
However I think there's a better solution, and that's to use mapM_ instead of mapM. I think it's usually better style (and nearly always better performance) to create a function that does what it's supposed to with each result rather than returning a list. Here, you can change your main function to:
main = do
withArchive [CheckConsFlag] jarPath $ do
classfiles <- filter isClassfile <$> fileNames []
forM_ classfiles $ \classfile -> do
stream <- BL.pack <$> fileContents [] classfile
let cls = runGet getClass stream
lift $ print cls
Now the print is lifted into the function passed to forM_ for each classfile. The value cls is used internally and never returned, so it's both fully evaluated and quickly GC'd on each iteration of forM_.
Making use of this style in a larger application may require some refactoring or even redesign, but the results may be worth it.
Edit: If you're going to the trouble to redesign your code, you could use iteratees and avoid this problem entirely.
Your idea to force evaluation of cls in line 94 was right. But I guess you're approach to do so wasn't successfull. See this paste for my version, which runs in ca. 40MB instead of 220MB.
The key is to force reduction to normal form of cls, which is done by rnf cls. And this has to happen before the call to return. Therefore:
rnf cls `seq` return cls
Alternatively, you could use Control.Exception.evaluate:
evaluate $ rnf cls
return cls
Thanks for the suggestions.
I think for my concrete problem, the solution will be to process .jar files in small chunks -- fortunately, inner classes are always in the same dir in the .jar file as their outer class, so there is no need to process all 50 megs in one run.
The only thing I couldn't quite understand is if it is possible to use libzip via enumerators, or would that need a new libzip implementation?
Related
I create a new cell in my jupyter notebook.
I type %%time in the first line of my new cell.
I type some codes in the second line.
I run this cell and get some information as follows
CPU times: user 2min 8s, sys: 14.5 s, total: 2min 22s
Wall time: 1min 29s
My question is what does these parameters mean?
CPU times, user, sys, total(I think that it means user+total), Wall time
If we run the code below in a cell:
%%time
from time import sleep
for i in range(3):
print(i, end=' ')
sleep(0.1)
The output is:
0 1 2
CPU times: user 5.69 ms, sys: 118 µs, total: 5.81 ms
Wall time: 304 ms
The wall time means that a clock hanging on a wall outside of the computer would measure 304 ms from the time the code was submitted to the CPU to the time when the process completed.
User time and sys time both refer to time taken by the CPU to actually work on the code. The CPU time dedicated to our code is only a fraction of the wall time as the CPU swaps its attention from our code to other processes that are running on the system.
User time is the amount of CPU time taken outside of the kernel. Sys time is the amount of time taken inside of the kernel. The total CPU time is user time + sys time. The differences between user and sys time is well explained in the post:
What do 'real', 'user' and 'sys' mean in the output of time(1)?
I have gctrace output that looks like this:
gc 6 #48.155s 15%: 0.093+12360+0.32 ms clock, 0.18+7720/21356/3615+0.65 ms cpu, 11039->13278->6876 MB, 14183 MB goal, 8 P
I am not sure how to read the CPU times in particular. I understand that it is broken down into three phases (STW sweep termination, concurrent mark/scan, and STW mark termination), but I'm not sure what the + signs mean (i.e. 0.18+7720 and 3615+0.65). What do these + signs signify?
In your case, they look like assist and termination times;
// CPU time
0.18 : **STW** Sweep termination.
7720ms : Mark/Scan - Assist Time (GC performed in line with allocation).
21356ms : Mark/Scan - Background GC time.
3615ms : Mark/Scan - Idle GC time.
0.65ms : **STW** Mark termination.
I think it changes (or it may) over various Go versions and you can find more detailed info at runtime package docs.
Currently, it is:
gc # ##s #%: #+#+# ms clock, #+#/#/#+# ms cpu, #->#-># MB, # MB goal, # P
where the fields are as follows:
gc # the GC number, incremented at each GC
##s time in seconds since program start
#% percentage of time spent in GC since program start
#+...+# wall-clock/CPU times for the phases of the GC
#->#-># MB heap size at GC start, at GC end, and live heap
# MB goal goal heap size
# P number of processors used
Example here
See also Interpreting GC trace output
gc 6 #48.155s 15%: 0.093+12360+0.32 ms clock,
0.18+7720/21356/3615+0.65 ms cpu, 11039->13278->6876 MB, 14183 MB goal, 8 P
gc 6
#48.155s since program start
15%: of time spent in GC since program start
0.093+12360+0.32 ms clock stop-the-world (STW) sweep termination + concurrent
mark and scan + and STW mark termination
0.18+7720/21356/3615+0.65 ms cpu (GC performed in
line with allocation), background GC time, and idle GC time
11039->13278->6876 MB heap size at GC start, at GC end, and live heap
8 P number of processors used
Lets assume I have the following requirements for writing monitor files from a simulation:
A large number of individual files has to be written, typically in the order of 10000
The files must be human-readable, i.e. formatted I/O
Periodically, a new line is added to each file. Typically every 50 seconds.
The new data has to be accessible almost instantly, so large manual write buffers are not an option
We are on a Lustre file system that appears to be optimized for just about the opposite: sequential writes to a small number of large files.
It was not me who formulated the requirements so unfortunately there is no point in discussing them. I would just like to find the best possible solution with above prerequisites.
I came up with a little working example to test a few implementations. Here is the best I could do so far:
!===============================================================!
! program to test some I/O implementations for many small files !
!===============================================================!
PROGRAM iotest
use types
use omp_lib
implicit none
INTEGER(I4B), PARAMETER :: steps = 1000
INTEGER(I4B), PARAMETER :: monitors = 1000
INTEGER(I4B), PARAMETER :: cachesize = 10
INTEGER(I8B) :: counti, countf, count_rate, counti_global, countf_global
REAL(DP) :: telapsed, telapsed_global
REAL(DP), DIMENSION(:,:), ALLOCATABLE :: density, pressure, vel_x, vel_y, vel_z
INTEGER(I4B) :: n, t, unitnumber, c, i, thread
CHARACTER(LEN=100) :: dummy_char, number
REAL(DP), DIMENSION(:,:,:), ALLOCATABLE :: writecache_real
call system_clock(counti_global,count_rate)
! allocate cache
allocate(writecache_real(5,cachesize,monitors))
writecache_real = 0.0_dp
! fill values
allocate(density(steps,monitors), pressure(steps,monitors), vel_x(steps,monitors), vel_y(steps,monitors), vel_z(steps,monitors))
do n=1, monitors
do t=1, steps
call random_number(density(t,n))
call random_number(pressure(t,n))
call random_number(vel_x(t,n))
call random_number(vel_y(t,n))
call random_number(vel_z(t,n))
end do
end do
! create files
do n=1, monitors
write(number,'(I0.8)') n
dummy_char = 'monitor_' // trim(adjustl(number)) // '.dat'
open(unit=20, file=trim(adjustl(dummy_char)), status='replace', action='write')
close(20)
end do
call system_clock(counti)
! write data
c = 0
do t=1, steps
c = c + 1
do n=1, monitors
writecache_real(1,c,n) = density(t,n)
writecache_real(2,c,n) = pressure(t,n)
writecache_real(3,c,n) = vel_x(t,n)
writecache_real(4,c,n) = vel_y(t,n)
writecache_real(5,c,n) = vel_z(t,n)
end do
if(c .EQ. cachesize .OR. t .EQ. steps) then
!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(n,number,dummy_char,unitnumber, thread)
thread = OMP_get_thread_num()
unitnumber = thread + 20
!$OMP DO
do n=1, monitors
write(number,'(I0.8)') n
dummy_char = 'monitor_' // trim(adjustl(number)) // '.dat'
open(unit=unitnumber, file=trim(adjustl(dummy_char)), status='old', action='write', position='append', buffered='yes')
write(unitnumber,'(5ES25.15)') writecache_real(:,1:c,n)
close(unitnumber)
end do
!$OMP END DO
!$OMP END PARALLEL
c = 0
end if
end do
call system_clock(countf)
call system_clock(countf_global)
telapsed=real(countf-counti,kind=dp)/real(count_rate,kind=dp)
telapsed_global=real(countf_global-counti_global,kind=dp)/real(count_rate,kind=dp)
write(*,*)
write(*,'(A,F15.6,A)') ' elapsed wall time for I/O: ', telapsed, ' seconds'
write(*,'(A,F15.6,A)') ' global elapsed wall time: ', telapsed_global, ' seconds'
write(*,*)
END PROGRAM iotest
The main features are: OpenMP parallelization and a manual write buffer.
Here are some of the timings on the Lustre file system with 16 threads:
cachesize=5: elapsed wall time for I/O: 991.627404 seconds
cachesize=10: elapsed wall time for I/O: 415.456265 seconds
cachesize=20: elapsed wall time for I/O: 93.842964 seconds
cachesize=50: elapsed wall time for I/O: 79.859099 seconds
cachesize=100: elapsed wall time for I/O: 23.937832 seconds
cachesize=1000: elapsed wall time for I/O: 10.472421 seconds
For reference the results on a local workstation HDD with deactivated HDD write cache, 16 threads:
cachesize=1: elapsed wall time for I/O: 5.543722 seconds
cachesize=2: elapsed wall time for I/O: 2.791811 seconds
cachesize=3: elapsed wall time for I/O: 1.752962 seconds
cachesize=4: elapsed wall time for I/O: 1.630385 seconds
cachesize=5: elapsed wall time for I/O: 1.174099 seconds
cachesize=10: elapsed wall time for I/O: 0.700624 seconds
cachesize=20: elapsed wall time for I/O: 0.433936 seconds
cachesize=50: elapsed wall time for I/O: 0.425782 seconds
cachesize=100: elapsed wall time for I/O: 0.227552 seconds
As you can see the implementation is still embarrassingly slow on the Lustre file system compared to an ordinary HDD and I would need huge buffer sizes to reduce the I/O overhead to a tolerable extent. This would mean that the output lags behind which is against the requirements formulated earlier.
Another promising approach was leaving the units open between consecutive writes. Unfortunately, the number of units open simultaneously is limited to typically 1024-4096 without root privileges. So this is not an option because the number of files can exceed this limit.
How could the I/O overhead be further reduced while still fulfilling the requirements?
Edit 1
From the discussion with Gilles I learned that the lustre file system can be tweaked even with normal user privileges. So I tried setting the stripe count to 1 as suggested (this was already the default setting) and decreased the stripe size to the minimum supported value of 64k (default value was 1M). However, this did not improve I/O performance with my test case. If anyone has additional hints on more suitable file system settings please let me know.
For everyone suffering from small files performance, the new lustre release 2.11 allows storing the small files directly on MDT, which improves access time for those.
http://cdn.opensfs.org/wp-content/uploads/2018/04/Leers-Lustre-Data_on_MDT_An_Early_Look_DDN.pdf
lfs setstripe -E 1M -L mdt -E -1 fubar fill store the first megabyte of all files in directory fubar on MDT
I have been doing some benchmarking in Ruby and have the following results:
user system total real
part1 0.156000 0.000000 0.156000 ( 0.158009)
user system total real
part2 0.015000 0.000000 0.015000 ( 0.162010)
Commonly, as in part1, the total and real times are nearly the same. However this is not true in part 2.
What is the meaning of the total and real divergence in part2?
Does the divergence raise any concerns?
What run is faster?
user/system are cpu times, measured by the kernel. which have scheduled your
process.
real time is the time of calculation.
So real time bigger than user+system mean :
io or sleep in code tested
there a other process/daemon which consumes CPU
The results are organized in columns and are in this order; user CPU time, system CPU time, the sum of the user and system CPU times, and the elapsed real time. The units of them all are seconds. So in real time, part1 was faster than part2.
Converting non-negative Integer to its list of digits is commonly done like this:
import Data.Char
digits :: Integer -> [Int]
digits = (map digitToInt) . show
I was trying to find a more direct way to perform the task, without involving a string conversion, but I'm unable to come up with something faster.
Things I've been trying so far:
The baseline:
digits :: Int -> [Int]
digits = (map digitToInt) . show
Got this one from another question on StackOverflow:
digits2 :: Int -> [Int]
digits2 = map (`mod` 10) . reverse . takeWhile (> 0) . iterate (`div` 10)
Trying to roll my own:
digits3 :: Int -> [Int]
digits3 = reverse . revDigits3
revDigits3 :: Int -> [Int]
revDigits3 n = case divMod n 10 of
(0, digit) -> [digit]
(rest, digit) -> digit:(revDigits3 rest)
This one was inspired by showInt in Numeric:
digits4 n0 = go n0 [] where
go n cs
| n < 10 = n:cs
| otherwise = go q (r:cs)
where
(q,r) = n `quotRem` 10
Now the benchmark. Note: I'm forcing the evaluation using filter.
λ>:set +s
λ>length $ filter (>5) $ concat $ map (digits) [1..1000000]
2400000
(1.58 secs, 771212628 bytes)
This is the reference. Now for digits2:
λ>length $ filter (>5) $ concat $ map (digits2) [1..1000000]
2400000
(5.47 secs, 1256170448 bytes)
That's 3.46 times longer.
λ>length $ filter (>5) $ concat $ map (digits3) [1..1000000]
2400000
(7.74 secs, 1365486528 bytes)
digits3 is 4.89 time slower. Just for fun, I tried using only revDigits3 and avoid the reverse.
λ>length $ filter (>5) $ concat $ map (revDigits3) [1..1000000]
2400000
(8.28 secs, 1277538760 bytes)
Strangely, this is even slower, 5.24 times slower.
And the last one:
λ>length $ filter (>5) $ concat $ map (digits4) [1..1000000]
2400000
(16.48 secs, 1779445968 bytes)
This is 10.43 time slower.
I was under the impression that only using arithmetic and cons would outperform anything involving a string conversion. Apparently, there something I can't grasp.
So what the trick? Why is digits so fast?
I'm using GHC 6.12.3.
Seeing as I can't add comments yet, I'll do a little bit more work and just analyze all of them. I'm putting the analysis at the top; however, the relevant data is below. (Note: all of this is done in 6.12.3 as well - no GHC 7 magic yet.)
Analysis:
Version 1: show is pretty good for ints, especially those as short as we have. Making strings actually tends to be decent in GHC; however reading to strings and writing large strings to files (or stdout, although you wouldn't want to do that) are where your code can absolutely crawl. I would suspect that a lot of the details behind why this is so fast are due to clever optimizations within show for Ints.
Version 2: This one was the slowest of the bunch when compiled. Some problems: reverse is strict in its argument. What this means is that you don't benefit from being able to perform computations on the first part of the list while you're computing the next elements; you have to compute them all, flip them, and then do your computations (namely (`mod` 10) ) on the elements of the list. While this may seem small, it can lead to greater memory usage (note the 5GB of heap memory allocated here as well) and slower computations. (Long story short: don't use reverse.)
Version 3: Remember how I just said don't use reverse? Turns out, if you take it out, this one drops to 1.79s total execution time - barely slower than the baseline. The only problem here is that as you go deeper into the number, you're building up the spine of the list in the wrong direction (essentially, you're consing "into" the list with recursion, as opposed to consing "onto" the list).
Version 4: This is a very clever implementation. You benefit from several nice things: for one, quotRem should use the Euclidean algorithm, which is logarithmic in its larger argument. (Maybe it's faster, but I don't believe there's anything that's more than a constant factor faster than Euclid.) Furthermore, you cons onto the list as discussed last time, so that you don't have to resolve any list thunks as you go - the list is already entirely constructed when you come back around to parse it. As you can see, the performance benefits from this.
This code was probably the slowest in GHCi because a lot of the optimizations performed with the -O3 flag in GHC deal with making lists faster, whereas GHCi wouldn't do any of that.
Lessons: cons the right way onto a list, watch for intermediate strictness that can slow down computations, and do some legwork in looking at the fine-grained statistics of your code's performance. Also compile with the -O3 flags: whenever you don't, all those people who put a lot of hours into making GHC super-fast get big ol' puppy eyes at you.
Data:
I just took all four functions, stuck them into one .hs file, and then changed as necessary to reflect the function in use. Also, I bumped your limit up to 5e6, because in some cases compiled code would run in less than half a second on 1e6, and this can start to cause granularity problems with the measurements we're making.
Compiler options: use ghc --make -O3 [filename].hs to have GHC do some optimization. We'll dump statistics to standard error using digits +RTS -sstderr.
Dumping to -sstderr gives us output that looks like this, in the case of digits1:
digits1 +RTS -sstderr
12000000
2,885,827,628 bytes allocated in the heap
446,080 bytes copied during GC
3,224 bytes maximum residency (1 sample(s))
12,100 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 5504 collections, 0 parallel, 0.06s, 0.03s elapsed
Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.61s ( 1.66s elapsed)
GC time 0.06s ( 0.03s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 1.67s ( 1.69s elapsed)
%GC time 3.7% (1.5% elapsed)
Alloc rate 1,795,998,050 bytes per MUT second
Productivity 96.3% of total user, 95.2% of total elapsed
There are three key statistics here:
Total memory in use: only 1MB means this version is very space-efficient.
Total time: 1.61s means nothing now, but we'll see how it looks against the other implementations.
Productivity: This is just 100% minus garbage collecting; since we're at 96.3% this means that we're not creating a lot of objects that we leave lying around in memory..
Alright, let's move on to version 2.
digits2 +RTS -sstderr
12000000
5,512,869,824 bytes allocated in the heap
1,312,416 bytes copied during GC
3,336 bytes maximum residency (1 sample(s))
13,048 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 10515 collections, 0 parallel, 0.06s, 0.04s elapsed
Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 3.20s ( 3.25s elapsed)
GC time 0.06s ( 0.04s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 3.26s ( 3.29s elapsed)
%GC time 1.9% (1.2% elapsed)
Alloc rate 1,723,838,984 bytes per MUT second
Productivity 98.1% of total user, 97.1% of total elapsed
Alright, so we're seeing an interesting pattern.
Same amount of memory used. This means that this is a pretty good implementation, although it could mean that we need to test on higher sample inputs to see if we can find a difference.
It takes twice as long. We'll come back to some speculation as to why this is later.
It's actually slightly more productive, but given that GC is not a huge portion of either program this doesn't help us anything significant.
Version 3:
digits3 +RTS -sstderr
12000000
3,231,154,752 bytes allocated in the heap
832,724 bytes copied during GC
3,292 bytes maximum residency (1 sample(s))
12,100 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 6163 collections, 0 parallel, 0.02s, 0.02s elapsed
Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 2.09s ( 2.08s elapsed)
GC time 0.02s ( 0.02s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 2.11s ( 2.10s elapsed)
%GC time 0.7% (1.0% elapsed)
Alloc rate 1,545,701,615 bytes per MUT second
Productivity 99.3% of total user, 99.3% of total elapsed
Alright, so we're seeing some strange patterns.
We're still at 1MB total memory in use. So we haven't hit anything memory-inefficient, which is good.
We're not quite at digits1, but we've got digits2 beat pretty easily.
Very little GC. (Keep in mind that anything over 95% productivity is very good, so we're not really dealing with anything too significant here.)
And finally, version 4:
digits4 +RTS -sstderr
12000000
1,347,856,636 bytes allocated in the heap
270,692 bytes copied during GC
3,180 bytes maximum residency (1 sample(s))
12,100 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 2570 collections, 0 parallel, 0.00s, 0.01s elapsed
Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.09s ( 1.08s elapsed)
GC time 0.00s ( 0.01s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 1.09s ( 1.09s elapsed)
%GC time 0.0% (0.8% elapsed)
Alloc rate 1,234,293,036 bytes per MUT second
Productivity 100.0% of total user, 100.5% of total elapsed
Wowza! Let's break it down:
We're still at 1MB total. This is almost certainly a feature of these implementations, as they remain at 1MB on inputs of 5e5 and 5e7. A testament to laziness, if you will.
We cut off about 32% of our original time, which is pretty impressive.
I suspect that the percentages here reflect the granularity in -sstderr's monitoring rather than any computation on superluminal particles.
Answering the question "why rem instead of mod?" in the comments. When dealing with positive values rem x y === mod x y so the only consideration of note is performance:
> import Test.QuickCheck
> quickCheck (\x y -> x > 0 && y > 0 ==> x `rem` y == x `mod` y)
So what is the performance? Unless you have a good reason not to (and being lazy isn't a good reason, neither is not knowing Criterion) then use a good benchmark tool, I used Criterion:
$ cat useRem.hs
import Criterion
import Criterion.Main
list :: [Integer]
list = [1..10000]
main = defaultMain
[ bench "mod" (nf (map (`mod` 7)) list)
, bench "rem" (nf (map (`rem` 7)) list)
]
Running this shows rem is measurably better than mod (compiled with -O2):
$ ./useRem
...
benchmarking mod
...
mean: 590.4692 us, lb 589.2473 us, ub 592.1766 us, ci 0.950
benchmarking rem
...
mean: 394.1580 us, lb 393.2415 us, ub 395.4184 us, ci 0.950