ByteString concatMap performance - performance

I have a 37MB bin file I am trying to convert to a ppm sequence. It works fine, and I'm trying to use this as an exercise to learn some profiling and more about lazy bytestrings in Haskell. My program seems to bomb at the concatMap, which is used to replicate each byte three times so I have R, G, and B. The code is fairly straight forward - every 2048 bytes I write a new header:
{-# LANGUAGE OverloadedStrings #-}
import System.IO
import System.Environment
import Control.Monad
import qualified Data.ByteString.Lazy.Char8 as B
main :: IO ()
main = do [from, to] <- getArgs
withFile from ReadMode $ \inH ->
withFile to WriteMode $ \outH ->
loop (B.hGet inH 2048) (process outH) B.null
loop :: (Monad m) => m a -> (a -> m ()) -> (a -> Bool) -> m ()
loop inp outp done = inp >>= \x -> unless (done x) (outp x >> loop inp outp done)
process :: Handle -> B.ByteString -> IO ()
process h bs | B.null bs = return ()
| otherwise = B.hPut h header >> B.hPut h bs'
where header = "P6\n32 64\n255\n" :: B.ByteString
bs' = B.concatMap (B.replicate 3) bs
This pulls it off in a little over 5s. It's not terrible, and my only comparison is my very naive C implementation that does it a little under 4s - so that or ideally under has been my goal.
Here is the RTS from the above code:
33,435,345,688 bytes allocated in the heap
14,963,640 bytes copied during GC
54,640 bytes maximum residency (77 sample(s))
21,136 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 64604 colls, 0 par 0.20s 0.25s 0.0000s 0.0001s
Gen 1 77 colls, 0 par 0.00s 0.01s 0.0001s 0.0006s
INIT time 0.00s ( 0.00s elapsed)
MUT time 5.09s ( 5.27s elapsed)
GC time 0.21s ( 0.26s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 5.29s ( 5.52s elapsed)
%GC time 3.9% (4.6% elapsed)
Alloc rate 6,574,783,667 bytes per MUT second
Productivity 96.1% of total user, 92.1% of total elapsed
Pretty gnarly results. When I remove the concatMap and just copy everything over with the headers every 2048 bytes, it's practically instant:
70,983,992 bytes allocated in the heap
48,912 bytes copied during GC
54,640 bytes maximum residency (2 sample(s))
19,744 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 204 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0001s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.01s ( 0.07s elapsed)
GC time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.02s ( 0.07s elapsed)
%GC time 9.6% (2.9% elapsed)
Alloc rate 5,026,838,892 bytes per MUT second
Productivity 89.8% of total user, 22.3% of total elapsed
So I guess my question is two fold:
How can I improve the overall performance?
And had the bottleneck not been so obvious, what are some ways I could have tracked it down?
Thank you.
Edit
Here's the final code and RTS if anyone is interested! I was also able to find additional bottlenecks by making use of ghc's profiler with -prof -auto-all -caf-all after reading up on the Profiling and optimization chapter of Real World Haskell.
{-# LANGUAGE OverloadedStrings #-}
import System.IO
import System.Environment
import Control.Monad
import Data.Monoid
import qualified Data.ByteString.Builder as BU
import qualified Data.ByteString.Lazy.Char8 as BL
main :: IO ()
main = do [from, to] <- getArgs
withFile from ReadMode $ \inH ->
withFile to WriteMode $ \outH ->
loop (BL.hGet inH 2048) (process outH) BL.null
loop :: (Monad m) => m a -> (a -> m ()) -> (a -> Bool) -> m ()
loop inp outp done = inp >>= \x -> unless (done x) (outp x >> loop inp outp done)
upConcatMap :: Monoid c => (Char -> c) -> BL.ByteString -> c
upConcatMap f bs = mconcat . map f $ BL.unpack bs
process :: Handle -> BL.ByteString -> IO ()
process h bs | BL.null bs = return ()
| otherwise = BU.hPutBuilder h frame
where header = "P6\n32 64\n255\n"
bs' = BU.toLazyByteString $ upConcatMap trip bs
frame = BU.lazyByteString $ mappend header bs'
trip c = let b = BU.char8 c in mconcat [b, b, b]
6,383,263,640 bytes allocated in the heap
18,596,984 bytes copied during GC
54,640 bytes maximum residency (2 sample(s))
31,056 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 11165 colls, 0 par 0.06s 0.06s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0001s 0.0002s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.69s ( 0.83s elapsed)
GC time 0.06s ( 0.06s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.75s ( 0.89s elapsed)
%GC time 7.4% (7.2% elapsed)
Alloc rate 9,194,103,284 bytes per MUT second
Productivity 92.6% of total user, 78.0% of total elapsed

What about Builder?
This version is ~5x faster for me:
process :: Handle -> B.ByteString -> IO ()
process h bs
| B.null bs = return ()
| otherwise = B.hPut h header >> B.hPutBuilder h bs'
where header = "P6\n32 64\n255\n" :: B.ByteString
bs' = mconcat $ map triple $ B.unpack bs
triple c = let b = B.char8 c in mconcat [b, b, b]
It allocates much less garbage.
ADD: for the reference, runtime statistics:
4,642,746,104 bytes allocated in the heap
390,110,640 bytes copied during GC
63,592 bytes maximum residency (2 sample(s))
21,648 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 8992 colls, 0 par 0.54s 0.63s 0.0001s 0.0017s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0002s 0.0002s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.98s ( 1.13s elapsed)
GC time 0.54s ( 0.63s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 1.52s ( 1.76s elapsed)
%GC time 35.4% (36.0% elapsed)
Alloc rate 4,718,237,910 bytes per MUT second
Productivity 64.6% of total user, 55.9% of total elapsed

Use a Builder to concatenate your ByteString from smaller ones and it will go faster. It's in the ByteString documentation.
Looking at the source, concatMap goes via a list:
concatMap :: (Word8 -> ByteString) -> ByteString -> ByteString
concatMap f = concat . foldr ((:) . f) []
And concat has to do a decent amount of work. Looks like the Builder advice is good.

Related

Memoization done, what now?

I was trying to solve a puzzle in Haskell and had written the following code:
u 0 p = 0.0
u 1 p = 1.0
u n p = 1.0 + minimum [((1.0-q)*(s k p)) + (u (n-k) p) | k <-[1..n], let q = (1.0-p)**(fromIntegral k)]
s 1 p = 0.0
s n p = 1.0 + minimum [((1.0-q)*(s (n-k) p)) + q*((s k p) + (u (n-k) p)) | k <-[1..(n-1)], let q = (1.0-(1.0-p)**(fromIntegral k))/(1.0-(1.0-p)**(fromIntegral n))]
This code was terribly slow though. I suspect the reason for this is that the same things get calculated again and again. I therefore made a memoized version:
memoUa = array (0,10000) ((0,0.0):(1,1.0):[(k,mua k) | k<- [2..10000]])
mua n = (1.0) + minimum [((1.0-q)*(memoSa ! k)) + (memoUa ! (n-k)) | k <-[1..n], let q = (1.0-0.02)**(fromIntegral k)]
memoSa = array (0,10000) ((0,0.0):(1,0.0):[(k,msa k) | k<- [2..10000]])
msa n = (1.0) + minimum [((1.0-q) * (memoSa ! (n-k))) + q*((memoSa ! k) + (memoUa ! (n-k))) | k <-[1..(n-1)], let q = (1.0-(1.0-0.02)**(fromIntegral k))/(1.0-(1.0-0.02)**(fromIntegral n))]
This seems to be a lot faster, but now I get an out of memory error. I do not understand why this happens (the same strategy in java, without recursion, has no problems). Could somebody point me in the right direction on how to improve this code?
EDIT: I am adding my java version here (as I don't know where else to put it). I realize that the code isn't really reader-friendly (no meaningful names, etc.), but I hope it is clear enough.
public class Main {
public static double calc(double p) {
double[] u = new double[10001];
double[] s = new double[10001];
u[0] = 0.0;
u[1] = 1.0;
s[0] = 0.0;
s[1] = 0.0;
for (int n=2;n<10001;n++) {
double q = 1.0;
double denom = 1.0;
for (int k = 1; k <= n; k++ ) {
denom = denom * (1.0 - p);
}
denom = 1.0 - denom;
s[n] = (double) n;
u[n] = (double) n;
for (int k = 1; k <= n; k++ ) {
q = (1.0 - p) * q;
if (k<n) {
double qs = (1.0-q)/denom;
double bs = (1.0-qs)*s[n-k] + qs*(s[k]+ u[n-k]) + 1.0;
if (bs < s[n]) {
s[n] = bs;
}
}
double bu = (1.0-q)*s[k] + 1.0 + u[n-k];
if (bu < u[n]) {
u[n] = bu;
}
}
}
return u[10000];
}
public static void main(String[] args) {
double s = 0.0;
int i = 2;
//for (int i = 1; i<51; i++) {
s = s + calc(i*0.01);
//}
System.out.println("result = " + s);
}
}
I don't run out of memory when I run the compiled version, but there is a significant difference between how the Java version works and how the Haskell version works which I'll illustrate here.
The first thing to do is to add some important type signatures. In particular, you don't want Integer array indices, so I added:
memoUa :: Array Int Double
memoSa :: Array Int Double
I found these using ghc-mod check. I also added a main so that you can run it from the command line:
import System.Environment
main = do
(arg:_) <- getArgs
let n = read arg
print $ mua n
Now to gain some insight into what's going on, we can compile the program using profiling:
ghc -O2 -prof memo.hs
Then when we invoke the program like this:
memo 1000 +RTS -s
we will get profiling output which looks like:
164.31333233347755
98,286,872 bytes allocated in the heap
29,455,360 bytes copied during GC
657,080 bytes maximum residency (29 sample(s))
38,260 bytes maximum slop
3 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 161 colls, 0 par 0.03s 0.03s 0.0002s 0.0011s
Gen 1 29 colls, 0 par 0.03s 0.03s 0.0011s 0.0017s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.21s ( 0.21s elapsed)
GC time 0.06s ( 0.06s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.27s ( 0.27s elapsed)
%GC time 21.8% (22.3% elapsed)
Alloc rate 468,514,624 bytes per MUT second
Productivity 78.2% of total user, 77.3% of total elapsed
Important things to pay attention to are:
maximum residency
Total time
%GC time (or Productivity)
Maximum residency is a measure of how much memory is needed by the program. %GC time the proportion of the time spent in garbage collection and Productivity is the complement (100% - %GC time).
If you run the program for various input values you will see a productivity of around 80%:
n Max Res. Prod. Time Output
2000 779,076 79.4% 1.10s 328.54535361588535
4000 1,023,016 80.7% 4.41s 657.0894961398351
6000 1,299,880 81.3% 9.91s 985.6071032981068
8000 1,539,352 81.5% 17.64s 1314.0968411684714
10000 1,815,600 81.7% 27.57s 1642.5891214360522
This means that about 20% of the run time is spent in garbage collection. Also, we see increasing memory usage as n increases.
It turns out we can dramatically improve productivity and memory usage by telling Haskell the order in which to evaluate the array elements instead of relying on lazy evaluation:
import Control.Monad (forM_)
main = do
(arg:_) <- getArgs
let n = read arg
forM_ [1..n] $ \i -> mua i `seq` return ()
print $ mua n
And the new profiling stats are:
n Max Res. Prod. Time Output
2000 482,800 99.3% 1.31s 328.54535361588535
4000 482,800 99.6% 5.88s 657.0894961398351
6000 482,800 99.5% 12.09s 985.6071032981068
8000 482,800 98.1% 21.71s 1314.0968411684714
10000 482,800 96.1% 34.58s 1642.5891214360522
Some interesting observations here: productivity is up, memory usage is down (constant now over the range of inputs) but run time is up. This suggests that we forced more computations than we needed to. In an imperative language like Java you have to give an evaluation order so you would know exactly which computations need to be performed. It would interesting to see your Java code to see which computations it is performing.

Why does this Haskell program perform so poorly?

I'm a Haskell newbie and I'm lost as to how this program performs so poorly. I tried forcing strict variables in various places but it doesn't seem to make a difference.
Here is my code (the purpose of this program is produce the frequencies of the input bytes found from standard input):
{-# LANGUAGE BangPatterns #-}
import Control.Concurrent (forkIO, killThread)
import Control.Concurrent.MVar
import qualified Data.IntMap as IntMap
import Data.IntMap.Strict (IntMap)
import Control.Monad.Fix
import Control.Monad (when)
import qualified Data.Char as Char
import qualified System.IO as IO
import System.IO (hSetBinaryMode, hFlush)
import Data.List as List
import Text.PrettyPrint.Boxes as Boxes
import Text.Printf (printf)
import Data.Function
data BFreq = BFreq Integer (IntMap Integer)
main :: IO ()
main = do
putStrLn "analyze data from stdin"
hSetBinaryMode IO.stdin True
mv <- newEmptyMVar
tid <- forkIO $ statusUpdater mv
bf <- run mv
killThread tid
displayResults bf
resultTable :: [[String]] -> Box
resultTable rows =
Boxes.hsep 4 Boxes.left boxed_cols
where
cols = transpose rows
boxed_cols = map (Boxes.vcat Boxes.left . map text) cols
displayResults :: BFreq -> IO ()
displayResults (BFreq n counts) = do
putStrLn $ "read " ++ (show n) ++ " bytes"
when (n > 0) (displayFreqs n counts)
displayFreqs :: Integer -> IntMap Integer -> IO ()
displayFreqs n counts =
do
putStrLn "frequencies:"
Boxes.printBox $ resultTable rows
where
cmp x y = compare (snd y) (snd x)
sorted_counts = List.sortBy cmp $ IntMap.assocs counts
intdiv :: Integer -> Integer -> Float
intdiv a b = (fromIntegral a) / (fromIntegral b)
percent y = printf "%.2f" (100*intdiv y n)
show_byte x = (show $ Char.chr x) ++ " (" ++ (show x) ++ "):"
show_count y = (percent y) ++ "% (" ++ (show y) ++ ")"
rows = map (\(x,y) -> [show_byte x, show_count y]) sorted_counts
run :: MVar Integer -> IO BFreq
run mv =
fn mv 0 IntMap.empty
where
fn mv !n !mp =
do
tryPutMVar mv n
eof <- IO.isEOF
if eof
then return $ BFreq n mp
else do
b <- getChar
fn mv (1+n) (new_map b)
where
k x = Char.ord x
old_val x = IntMap.findWithDefault 0 (k x) mp
new_map x = IntMap.insert (k x) ((old_val x)+1) mp
statusUpdater :: MVar Integer -> IO ()
statusUpdater mv =
do
takeMVar mv >>= print_progress
statusUpdater mv
where
print_progress n =
do
putStr $ "\rbytes: "
when (gbs > 0) $ putStr $ (show gbs) ++ " GBs "
when (mbs > 0) $ putStr $ (show mbs) ++ " MBs "
when (kbs > 0) $ putStr $ (show kbs) ++ " KBs "
when (gbs < 1 && mbs < 1 && kbs < 1) $ putStr $ (show bs) ++ " Bs "
hFlush IO.stdout
where
(gbs, gbr) = quotRem n 0x40000000
(mbs, mbr) = quotRem gbr 0x100000
(kbs, bs) = quotRem mbr 0x400
here is what happens when I run it (note: I am compiling with -O2):
$> cabal build -v
creating dist/build
creating dist/build/autogen
Building bfreq-0.1.0.0...
Preprocessing executable 'bfreq' for bfreq-0.1.0.0...
Building executable bfreq...
creating dist/build/bfreq
creating dist/build/bfreq/bfreq-tmp
/usr/bin/ghc --make -o dist/build/bfreq/bfreq -hide-all-packages -fbuilding-cabal-package -package-conf dist/package.conf.inplace -i -idist/build/bfreq/bfreq-tmp -i. -idist/build/autogen -Idist/build/autogen -Idist/build/bfreq/bfreq-tmp -optP-include -optPdist/build/autogen/cabal_macros.h -odir dist/build/bfreq/bfreq-tmp -hidir dist/build/bfreq/bfreq-tmp -stubdir dist/build/bfreq/bfreq-tmp -package-id base-4.5.0.0-40b99d05fae6a4eea95ea69e6e0c9702 -package-id boxes-0.1.3-e03668bca38fe3e879f9d695618ddef3 -package-id containers-0.5.3.1-80819105034e34d03d22b1c20d6fd868 -O -O2 -rtsopts -XHaskell98 ./bfreq.hs
[1 of 1] Compiling Main ( bfreq.hs, dist/build/bfreq/bfreq-tmp/Main.o )
Linking dist/build/bfreq/bfreq ...
$> cat /dev/urandom | head -c 9999999 > test_data
$> cat ./test_data | ./dist/build/bfreq/bfreq +RTS -sstderr
analyze data from stdin
bytes: 9 MBs 521 KBs read 9999999 bytes
frequencies:
'\137' (137): 0.40% (39642)
'H' (72): 0.40% (39608)
<...>
'L' (76): 0.39% (38617)
'\246' (246): 0.39% (38609)
'I' (73): 0.38% (38462)
'q' (113): 0.38% (38437)
9,857,106,520 bytes allocated in the heap
14,492,245,840 bytes copied during GC
3,406,696,360 bytes maximum residency (13 sample(s))
14,691,672 bytes maximum slop
6629 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 18348 colls, 0 par 10.90s 10.90s 0.0006s 0.0180s
Gen 1 13 colls, 0 par 15.20s 19.65s 1.5119s 12.6403s
INIT time 0.00s ( 0.00s elapsed)
MUT time 14.45s ( 14.79s elapsed)
GC time 26.10s ( 30.56s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 40.55s ( 45.35s elapsed)
%GC time 64.4% (67.4% elapsed)
Alloc rate 682,148,818 bytes per MUT second
Productivity 35.6% of total user, 31.9% of total elapsed
So unless I'm misinterpreting the above debug output, my program is using 6 GB?
The test data is less than 10 MB, so what's going on?
Any general advice on how to approach a problem like this in Haskell would be nice too. In otherwords, should I avoid Haskell for this kind of I/O-centric stuff? Should I be using the pipes library for this kind of thing?
EDIT:
Thanks for the help, correctly importing the strict version of IntMap fixes the memory issues.
I haven't been able to get the profiling (-fprof-auto) to work because it seems none of my packages are compiled for profiling. I solved the lack of profiling base libraries by installing the ghc profiling package for my OS (ubuntu: ghc-prof), but according to this I'll need to manually re-install all my haskell libraries for profiling. I don't have the time to do this at the moment, so I'm just putting this link here for the benefit of anyone who has a similar problem.
If you compile with -fprof-auto as per the GHC guide chapter on profiling, you'll see large allocation happening in run.fn.new_map and run.fn.
The code in question:
new_map x = IntMap.insert (k x) ((old_val x)+1) mp
Suspicion: ((old_val x)+1) is creating a chain of unevaluated thunks. Proposed change:
new_map x = let ov = old_val x + 1 in
ov `seq` IntMap.insert (k x) ov mp
Voila! Allocations, GC, and memory use are all way down.
EDIT: You probably intended to import qualified Data.IntMap.Strict as IntMap, making this change unnecessary.

Why Haskell function execution time measurment differs from ghc timings?

I want to measure time which Haskell spent to execute some function and use TimeIt package(also i tried these recommendations). But showed time differs from actual time application spent (I've ran application with +RTS -sstderr option):
CPU time: 4.85s
...
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.98s ( 61.69s elapsed)
GC time 0.22s ( 0.19s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 1.20s ( 61.89s elapsed)
Application source:
import qualified Data.ByteString.Lazy.Char8 as LBS
import System.Environment
import Data.Char
import Data.Int
import System.TimeIt
readChunks :: Int64 -> LBS.ByteString -> Int64
readChunks size str
| LBS.null str = 0
| otherwise = let (chunk, rest) = LBS.splitAt size str
in do
let len = LBS.length chunk
len `seq` len + readChunks size rest
processFile :: String -> IO()
processFile name = do
putStrLn name
content <- LBS.readFile name
let
(recNumStr, rest) = LBS.span (not.isControl) content
recNum = LBS.readInt recNumStr
case recNum of
Nothing -> putStrLn "can't parse"
Just (value, rest) -> print (value)
let chunkSize = 100*1024*1024
timeIt $ print (readChunks chunkSize rest)
UPDATE: I've found that Chronograph package shows right execution time (information taken from this question).
Well you are doing significant work that isn't being timed - it seems reasonable that this work makes up the difference, namely:
putStrLn name
content <- LBS.readFile name
let
(recNumStr, rest) = LBS.span (not.isControl) content
recNum = LBS.readInt recNumStr
case recNum of
Nothing -> putStrLn "can't parse"
Just (value, rest) -> print (value)
If you time that as well then you'll probably find most the difference. Also note there are other operations before you even enter main (which is true even for C programs).

Go memory consumption management

I am new to Go and trying to figure out how it manages memory consumption.
I have trouble with memory in one of my test projects. I don't understand why Go uses more and more memory (never freeing it) when my program runs for a long time.
I am running the test case provided below. After the first allocation, program uses nearly 350 MB of memory (according to ActivityMonitor). Then I try to free it and ActivityMonitor shows that memory consumption doubles. Why?
I am running this code on OS X using Go 1.0.3.
What is wrong with this code? And what is the right way to manage large variables in Go programs?
I had another memory-management-related problem when implementing an algorithm that uses a lot of time and memory; after running it for some time it throws an "out of memory" exception.
package main
import ("fmt"
"time"
)
func main() {
fmt.Println("getting memory")
tmp := make([]uint32, 100000000)
for kk, _ := range tmp {
tmp[kk] = 0
}
time.Sleep(5 * time.Second)
fmt.Println("returning memory")
tmp = make([]uint32, 1)
tmp = nil
time.Sleep(5 * time.Second)
fmt.Println("getting memory")
tmp = make([]uint32, 100000000)
for kk, _ := range tmp {
tmp[kk] = 0
}
time.Sleep(5 * time.Second)
fmt.Println("returning memory")
tmp = make([]uint32, 1)
tmp = nil
time.Sleep(5 * time.Second)
return
}
Currently, go uses a mark-and-sweep garbage collector, which in general does not define when the object is thrown away.
However, if you look closely, there is a go routine called sysmon which essentially runs as long as your program does and calls the GC periodically:
// forcegcperiod is the maximum time in nanoseconds between garbage
// collections. If we go this long without a garbage collection, one
// is forced to run.
//
// This is a variable for testing purposes. It normally doesn't change.
var forcegcperiod int64 = 2 * 60 * 1e9
(...)
// If a heap span goes unused for 5 minutes after a garbage collection,
// we hand it back to the operating system.
scavengelimit := int64(5 * 60 * 1e9)
forcegcperiod determines the period after which the GC is called by force. scavengelimit determines when spans are returned to the operating system. Spans are a number of memory pages which can hold several objects. They're kept for scavengelimit time and are freed if no object is on them and scavengelimit is exceeded.
Further down in the code you can see that there is a trace option. You can use this to see, whenever the
scavenger thinks he needs to clean up:
$ GOGCTRACE=1 go run gc.go
gc1(1): 0+0+0 ms 0 -> 0 MB 423 -> 350 (424-74) objects 0 handoff
gc2(1): 0+0+0 ms 1 -> 0 MB 2664 -> 1437 (2880-1443) objects 0 handoff
gc3(1): 0+0+0 ms 1 -> 0 MB 4117 -> 2213 (5712-3499) objects 0 handoff
gc4(1): 0+0+0 ms 2 -> 1 MB 3128 -> 2257 (6761-4504) objects 0 handoff
gc5(1): 0+0+0 ms 2 -> 0 MB 8892 -> 2531 (13734-11203) objects 0 handoff
gc6(1): 0+0+0 ms 1 -> 1 MB 8715 -> 2689 (20173-17484) objects 0 handoff
gc7(1): 0+0+0 ms 2 -> 1 MB 5231 -> 2406 (22878-20472) objects 0 handoff
gc1(1): 0+0+0 ms 0 -> 0 MB 172 -> 137 (173-36) objects 0 handoff
getting memory
gc2(1): 0+0+0 ms 381 -> 381 MB 203 -> 202 (248-46) objects 0 handoff
returning memory
getting memory
returning memory
As you can see, no gc invoke is done between getting and returning. However, if you change
the delay from 5 seconds to 3 minutes (more than the 2 minutes from forcegcperiod),
the objects are removed by the gc:
returning memory
scvg0: inuse: 1, idle: 1, sys: 3, released: 0, consumed: 3 (MB)
scvg0: inuse: 381, idle: 0, sys: 382, released: 0, consumed: 382 (MB)
scvg1: inuse: 1, idle: 1, sys: 3, released: 0, consumed: 3 (MB)
scvg1: inuse: 381, idle: 0, sys: 382, released: 0, consumed: 382 (MB)
gc9(1): 1+0+0 ms 1 -> 1 MB 4485 -> 2562 (26531-23969) objects 0 handoff
gc10(1): 1+0+0 ms 1 -> 1 MB 2563 -> 2561 (26532-23971) objects 0 handoff
scvg2: GC forced // forcegc (2 minutes) exceeded
scvg2: inuse: 1, idle: 1, sys: 3, released: 0, consumed: 3 (MB)
gc3(1): 0+0+0 ms 381 -> 381 MB 206 -> 206 (252-46) objects 0 handoff
scvg2: GC forced
scvg2: inuse: 381, idle: 0, sys: 382, released: 0, consumed: 382 (MB)
getting memory
The memory is still not freed, but the GC marked the memory region as unused. Freeing will begin when
the used span is unused and older than limit. From scavenger code:
if(s->unusedsince != 0 && (now - s->unusedsince) > limit) {
// ...
runtime·SysUnused((void*)(s->start << PageShift), s->npages << PageShift);
}
This behavior may of course change over time, but I hope you now get a bit of a feel when objects
are thrown away by force and when not.
As pointed out by zupa, releasing objects may not return the memory to the operating system, so on
certain systems you may not see a change in memory usage. This seems to be the case for Plan 9
and Windows according to this thread on golang-nuts.
To eventually (force) collect unused memory you must call runtime.GC().
variable = nil may make things unreachable and thus eligible for collection, but it per se doesn't free anything.

How do I optimize a loop which can be fully strict

I'm trying to write a brute-force solution to Project Euler Problem #145, and I cannot get my solution to run in less than about 1 minute 30 secs.
(I'm aware there are various short-cuts and even paper-and-pencil solutions; for the purpose of this question I'm not considering those).
In the best version I've come up with so far, profiling shows that the majority of the time is spent in foldDigits. This function need not be lazy at all, and to my mind ought to be optimized to a simple loop. As you can see I've attempted to make various bits of the program strict.
So my question is: without changing the overall algorithm, is there some way to bring the execution time of this program down to the sub-minute mark?
(Or if not, is there a way to see that the code of foldDigits is as optimized as possible?)
-- ghc -O3 -threaded Euler-145.hs && Euler-145.exe +RTS -N4
{-# LANGUAGE BangPatterns #-}
import Control.Parallel.Strategies
foldDigits :: (a -> Int -> a) -> a -> Int -> a
foldDigits f !acc !n
| n < 10 = i
| otherwise = foldDigits f i d
where (d, m) = n `quotRem` 10
!i = f acc m
reverseNumber :: Int -> Int
reverseNumber !n
= foldDigits accumulate 0 n
where accumulate !v !d = v * 10 + d
allDigitsOdd :: Int -> Bool
allDigitsOdd n
= foldDigits andOdd True n
where andOdd !a d = a && isOdd d
isOdd !x = x `rem` 2 /= 0
isReversible :: Int -> Bool
isReversible n
= notDivisibleByTen n && allDigitsOdd (n + rn)
where rn = reverseNumber n
notDivisibleByTen !x = x `rem` 10 /= 0
countRange acc start end
| start > end = acc
| otherwise = countRange (acc + v) (start + 1) end
where v = if isReversible start then 1 else 0
main
= print $ sum $ parMap rseq cr ranges
where max = 1000000000
qmax = max `div` 4
ranges = [(1, qmax), (qmax, qmax * 2), (qmax * 2, qmax * 3), (qmax * 3, max)]
cr (s, e) = countRange 0 s e
As it stands, the core that ghc-7.6.1 produces for foldDigits (with -O2) is
Rec {
$wfoldDigits_r2cK
:: forall a_aha.
(a_aha -> GHC.Types.Int -> a_aha)
-> a_aha -> GHC.Prim.Int# -> a_aha
[GblId, Arity=3, Caf=NoCafRefs, Str=DmdType C(C(S))SL]
$wfoldDigits_r2cK =
\ (# a_aha)
(w_s284 :: a_aha -> GHC.Types.Int -> a_aha)
(w1_s285 :: a_aha)
(ww_s288 :: GHC.Prim.Int#) ->
case w1_s285 of acc_Xhi { __DEFAULT ->
let {
ds_sNo [Dmd=Just D(D(T)S)] :: (GHC.Types.Int, GHC.Types.Int)
[LclId, Str=DmdType]
ds_sNo =
case GHC.Prim.quotRemInt# ww_s288 10
of _ { (# ipv_aJA, ipv1_aJB #) ->
(GHC.Types.I# ipv_aJA, GHC.Types.I# ipv1_aJB)
} } in
case w_s284 acc_Xhi (case ds_sNo of _ { (d_arS, m_Xsi) -> m_Xsi })
of i_ahg { __DEFAULT ->
case GHC.Prim.<# ww_s288 10 of _ {
GHC.Types.False ->
case ds_sNo of _ { (d_Xsi, m_Xs5) ->
case d_Xsi of _ { GHC.Types.I# ww1_X28L ->
$wfoldDigits_r2cK # a_aha w_s284 i_ahg ww1_X28L
}
};
GHC.Types.True -> i_ahg
}
}
}
end Rec }
which, as you can see, re-boxes the result of the quotRem call. The problem is that no property of f is available here, and as a recursive function, foldDigits cannot be inlined.
With a manual worker-wrapper transform making the function argument static,
foldDigits :: (a -> Int -> a) -> a -> Int -> a
foldDigits f = go
where
go !acc 0 = acc
go acc n = case n `quotRem` 10 of
(q,r) -> go (f acc r) q
foldDigits becomes inlinable, and you get specialised versions for your uses operating on unboxed data, but no top-level foldDigits, e.g.
Rec {
$wgo_r2di :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int#
[GblId, Arity=2, Caf=NoCafRefs, Str=DmdType LL]
$wgo_r2di =
\ (ww_s28F :: GHC.Prim.Int#) (ww1_s28J :: GHC.Prim.Int#) ->
case ww1_s28J of ds_XJh {
__DEFAULT ->
case GHC.Prim.quotRemInt# ds_XJh 10
of _ { (# ipv_aJK, ipv1_aJL #) ->
$wgo_r2di (GHC.Prim.+# (GHC.Prim.*# ww_s28F 10) ipv1_aJL) ipv_aJK
};
0 -> ww_s28F
}
end Rec }
and the effect on computation time is tangible, for the original, I got
$ ./eul145 +RTS -s -N2
608720
1,814,289,579,592 bytes allocated in the heap
196,407,088 bytes copied during GC
47,184 bytes maximum residency (2 sample(s))
30,640 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1827331 colls, 1827331 par 23.77s 11.86s 0.0000s 0.0041s
Gen 1 2 colls, 1 par 0.00s 0.00s 0.0001s 0.0001s
Parallel GC work balance: 54.94% (serial 0%, perfect 100%)
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)
SPARKS: 4 (3 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 620.52s (313.51s elapsed)
GC time 23.77s ( 11.86s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 644.29s (325.37s elapsed)
Alloc rate 2,923,834,808 bytes per MUT second
(I used -N2 since my i5 only has two physical cores), vs.
$ ./eul145 +RTS -s -N2
608720
16,000,063,624 bytes allocated in the heap
403,384 bytes copied during GC
47,184 bytes maximum residency (2 sample(s))
30,640 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 15852 colls, 15852 par 0.34s 0.17s 0.0000s 0.0037s
Gen 1 2 colls, 1 par 0.00s 0.00s 0.0001s 0.0001s
Parallel GC work balance: 43.86% (serial 0%, perfect 100%)
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)
SPARKS: 4 (3 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 314.85s (160.08s elapsed)
GC time 0.34s ( 0.17s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 315.20s (160.25s elapsed)
Alloc rate 50,817,657 bytes per MUT second
Productivity 99.9% of total user, 196.5% of total elapsed
with the modification. The running time roughly halved, and the allocations reduced 100-fold.

Resources