is there any diffs hugepaged/nohuge ram map to cpu cache? - caching

my server cpu info :
Model name: Intel(R) Xeon(R) CPU E5-2667 v3 # 3.20GHz
Stepping: 2
CPU MHz: 3200.000
CPU max MHz: 3200.0000
CPU min MHz: 1200.0000
BogoMIPS: 6391.94
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K
os kernel version:
3.10.0-1160.el7.x86_64
i'm doing some performance test on my server,and got some confusing on hugepaged ram issues.
as i use code from https://github.com/torvalds/test-tlb,
and run test shell scripts as:
for i in 128k 512k 1M 2M 4M 8M 16M 32M 64M 128M; do echo "$i:"; ./test-tlb -H 4G $i ; ./test-tlb 4G $i ; ./test-tlb -Hr 4G $i ; ./test-tlb -r 4G $i ; done
i got the following results:
128k:
87.24ns (~279.2 cycles)
98.24ns (~314.4 cycles)
90.07ns (~288.2 cycles)
108.19ns (~346.2 cycles)
512k:
87.17ns (~279.0 cycles)
92.55ns (~296.1 cycles)
90.51ns (~289.6 cycles)
95.03ns (~304.1 cycles)
1M:
88.28ns (~282.5 cycles)
34.73ns (~111.1 cycles)
90.76ns (~290.4 cycles)
36.47ns (~116.7 cycles)
2M:
91.01ns (~291.2 cycles)
34.91ns (~111.7 cycles)
91.56ns (~293.0 cycles)
35.51ns (~113.6 cycles)
4M:
90.92ns (~290.9 cycles)
34.96ns (~111.9 cycles)
91.62ns (~293.2 cycles)
35.46ns (~113.5 cycles)
8M:
90.99ns (~291.2 cycles)
35.04ns (~112.1 cycles)
91.55ns (~293.0 cycles)
35.31ns (~113.0 cycles)
16M:
77.65ns (~248.5 cycles)
35.38ns (~113.2 cycles)
78.23ns (~250.4 cycles)
35.49ns (~113.6 cycles)
32M:
20.51ns (~65.6 cycles)
35.47ns (~113.5 cycles)
20.61ns (~65.9 cycles)
35.96ns (~115.1 cycles)
64M:
21.06ns (~67.4 cycles)
25.61ns (~81.9 cycles)
20.66ns (~66.1 cycles)
23.35ns (~74.7 cycles)
128M:
21.12ns (~67.6 cycles)
14.64ns (~46.8 cycles)
20.69ns (~66.2 cycles)
14.92ns (~47.7 cycles)
A significant difference from stride 1MB shows that it's has higher latency when use hugepaged ram(thp and reserved hugepage produce the same result,as i tested).
so is there anything in hardware or os kernel level could cause this confusing result?

Related

Packed large bit vector with efficient xor and bit count in Haskell

I am looking for an efficient (in both space and time) data type which can hold a 384 bit vector and supports efficient XOR and "bit count" (number of bits set to 1) operations.
Below, please find my demo program. The operations I need are all in the SOQuestionOps type class and I have implemented it for Natural and Data.Vector.Unboxed.Bit. Especially the latter seems perfect as it has a zipWords operation which should allow me to do operations like "bit count" and XOR word-by-word instead of bit-by-bit. Also it claims to store the bits packed (8 bits per byte).
{-# LANGUAGE FlexibleInstances #-}
import Data.Bits
import Data.List (foldl')
import Numeric.Natural
import qualified Data.Vector as V
import qualified Data.Vector.Unboxed.Bit as BV
class SOQuestionOps a where
soqoXOR :: a -> a -> a
soqoBitCount :: a -> Int
soqoFromList :: [Bool] -> a
alternating :: Int -> [Bool]
alternating n =
let c = n `mod` 2 == 0
in if n == 0
then []
else c : alternating (n-1)
instance SOQuestionOps Natural where
soqoXOR = xor
soqoBitCount = popCount
soqoFromList v =
let oneIdxs = map snd $ filter fst (zip v [0..])
in foldl' (\acc n -> acc `setBit` n) 0 oneIdxs
instance SOQuestionOps (BV.Vector BV.Bit) where
soqoXOR = BV.zipWords xor
soqoBitCount = BV.countBits
soqoFromList v = BV.fromList (map BV.fromBool v)
main =
let initialVec :: BV.Vector BV.Bit
initialVec = soqoFromList $ alternating 384
lotsOfVecs = V.replicate 10000000 (soqoFromList $ take 384 $ repeat True)
xorFolded = V.foldl' soqoXOR initialVec lotsOfVecs
sumBitCounts = V.foldl' (\n v -> n + soqoBitCount v) 0 lotsOfVecs
in putStrLn $ "folded bit count: " ++ show (soqoBitCount xorFolded) ++ ", sum: " ++ show sumBitCounts
So let's calculate numbers for the best case: lotsOfVecs shouldn't need to allocate much because it's just 10,000,000 times the same vector initialVec. The foldl obviously creates one of these vectors per fold operation, so it should create 10,000,000 bit vectors. The bit counting should create anything but 10,000,000 Ints. So in the best case, my program should use very little (and constant) memory and the total allocations should roughly be 10,000,000 * sizeof(bit vector) + 10,000,000 * sizeof(int) = 520,000,000 bytes .
Ok, let's run the program for Natural:
let's make initialVec :: Natural, compile with
ghc --make -rtsopts -O3 MemStuff.hs
result (this is with GHC 7.10.1):
$ ./MemStuff +RTS -sstderr
folded bit count: 192, sum: 3840000000
1,280,306,112 bytes allocated in the heap
201,720 bytes copied during GC
80,106,856 bytes maximum residency (2 sample(s))
662,168 bytes maximum slop
78 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 2321 colls, 0 par 0.056s 0.059s 0.0000s 0.0530s
Gen 1 2 colls, 0 par 0.065s 0.069s 0.0346s 0.0674s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.579s ( 0.608s elapsed)
GC time 0.122s ( 0.128s elapsed)
EXIT time 0.000s ( 0.002s elapsed)
Total time 0.702s ( 0.738s elapsed)
%GC time 17.3% (17.3% elapsed)
Alloc rate 2,209,576,763 bytes per MUT second
Productivity 82.7% of total user, 78.7% of total elapsed
real 0m0.754s
user 0m0.704s
sys 0m0.037s
which has 1,280,306,112 bytes allocated in the heap, that's in the ballpark (2x) of the expected figure. Btw on GHC 7.8 this allocates 353,480,272,096 bytes and runs for absolute ages as popCount isn't very efficient on GHC 7.8's Naturals.
EDIT: I changed the code a bit. In the original version, every other vector was 0 in the fold. Which gave a lot better allocation figures for the Natural version. I changed it so the vector alternates between to different representations (with many bits set) and now we see 2x allocations of the expected. That's another downside of Natural (and Integer): The allocation rate depends on the values.
But maybe we can do better, let's try the densely packed Data.Vector.Unboxed.Bit:
That's initialVec :: BV.Vector BV.Bit and re-compile and re-run with the same options.
$ time ./MemStuff +RTS -sstderr
folded bit count: 192, sum: 1920000000
75,120,306,536 bytes allocated in the heap
54,914,640 bytes copied during GC
80,107,368 bytes maximum residency (2 sample(s))
664,128 bytes maximum slop
78 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 145985 colls, 0 par 0.543s 0.627s 0.0000s 0.0577s
Gen 1 2 colls, 0 par 0.065s 0.070s 0.0351s 0.0686s
INIT time 0.000s ( 0.000s elapsed)
MUT time 27.679s ( 28.228s elapsed)
GC time 0.608s ( 0.698s elapsed)
EXIT time 0.000s ( 0.002s elapsed)
Total time 28.288s ( 28.928s elapsed)
%GC time 2.1% (2.4% elapsed)
Alloc rate 2,714,015,097 bytes per MUT second
Productivity 97.8% of total user, 95.7% of total elapsed
real 0m28.944s
user 0m28.290s
sys 0m0.456s
That's very slow and roughly 100 times the allocations :(.
Ok, then lets recompile and profile both runs (ghc --make -rtsopts -O3 -prof -auto-all -caf-all -fforce-recomp MemStuff.hs):
The Natural version:
COST CENTRE MODULE %time %alloc
main.xorFolded Main 51.7 76.0
main.sumBitCounts.\ Main 25.4 16.0
main.sumBitCounts Main 12.1 0.0
main.lotsOfVecs Main 10.4 8.0
The Data.Vector.Unboxed.Bit version:
COST CENTRE MODULE %time %alloc
soqoXOR Main 96.7 99.3
main.sumBitCounts.\ Main 1.9 0.2
Is Natural really the best option for a fixed size bit vector? And what about GHC 6.8? And is there anything better which can implement my SOQuestionOps type class?
Have a look at the Data.LargeWord module in the Crypto package:
http://hackage.haskell.org/package/Crypto-4.2.5.1/docs/Data-LargeWord.html
It provides Bits instances for large words of various sizes, e.g. 96 through 256 bits.

Optimizing a program and calculating % of total execution time improved

So I was told to ask this on here instead of StackExchage:
If I have a program P, which runs on a 2GHz machine M in 30seconds and is optimized by replacing all instances of 'raise to the power 4' with 3 instructions of multiplying x by. This optimized program will be P'. The CPI of multiplication is 2 and CPI of power is 12. If there are 10^9 such operations optimized, what is the percent of total execution time improved?
Here is what I've deduced so far.
For P, we have:
time (30s)
CPI: 12
Frequency (2GHz)
For P', we have:
CPI (6) [2*3]
Frequency (2GHz)
So I need to figure our how to calculate the time of P' in order to compare the times. But I have no idea how to achieve this. Could someone please help me out?
Program P, which runs on a 2GHz machine M in 30 seconds and is optimized by replacing all instances of 'raise to the power 4' with 3 instructions of multiplying x by. This optimized program will be P'. The CPI of multiplication is 2 and CPI of power is 12. If there are 10^9 such operations optimized,
From this information we can compute time needed to execute all POWER4 ("raise to the power 4) instructions, we have total count of such instructions (all POWER4 was replaced, count is 10^9 or 1 G). Every POWER4 instruction needs 12 clock cycles (CPI = clock per instruction), so all POWER4 were executed in 1G * 12 = 12G cycles.
2GHz machine has 2G cycles per second, and there are 30 seconds of execution. Total P program execution is 2G*30 = 60 G cycles (60 * 10^9). We can conclude that P program has some other instructions. We don't know what instructions, how many executions they have and there is no information about their mean CPI. But we know that time needed to execute other instructions is 60 G - 12 G = 48 G (total program running time minus POWER4 running time - true for simple processors). There is some X executed instructions with Y mean CPI, so X*Y = 48 G.
So, total cycles executed for the program P is
Freq * seconds = POWER4_count * POWER4_CPI + OTHER_count * OTHER_mean_CPI
2G * 30 = 1G * 12 + X*Y
Or total running time for P:
30s = (1G * 12 + X*Y) / 2GHz
what is the percent of total execution time improved?
After replacing 1G POWER4 operations with 3 times more MUL instructions (multiply by) we have 3G MUL operations, and cycles needed for them is now CPI * count, where MUL CPI is 2: 2*3G = 6G cycles. X*Y part of P' was unchanged, and we can solve the problem.
P' time in seconds = ( MUL_count * MUL_CPI + OTHER_count * OTHER_mean_CPI ) / Frequency
P' time = (3G*2 + X*Y) / 2GHz
Improvement is not so big as can be excepted, because POWER4 instructions in P takes only some part of running time: 12G/60G; and optimization converted 12G to 6G, without changing remaining 48 G cycles part. By halving only some part of time we get not half of time.

OpenCL matrix multiplication should be faster?

I'm trying to learn how to make GPU optimalized OpenCL kernells, I took example of matrix multiplication using square tiles in local memory. However I got at best case just ~10-times speedup ( ~50 Gflops ) in comparison to numpy.dot() ( 5 Gflops , it is using BLAS).
I found studies where they got speedup >200x ( >1000 Gflops ).
ftp://ftp.u-aizu.ac.jp/u-aizu/doc/Tech-Report/2012/2012-002.pdf
I don't know what I'm doing wrong, or if it is just because of my GPU ( nvidia GTX 275 ). Or if it is because of some pyOpenCl overhead. But I meassured also how long does take just to copy result from GPU to RAM and it is just ~10% of the matrix multiplication time.
#define BLOCK_SIZE 22
__kernel void matrixMul(
__global float* Cij,
__global float* Aik,
__global float* Bkj,
__const int ni,
__const int nj,
__const int nk
){
// WARRNING : interchange of i and j dimension lower the performance >2x on my nV GT275 GPU
int gj = get_global_id(0); int gi = get_global_id(1);
int bj = get_group_id(0); int bi = get_group_id(1); // Block index
int tj = get_local_id(0); int ti = get_local_id(1); // Thread index
int oj = bi*BLOCK_SIZE; int oi = bj*BLOCK_SIZE;
float Csub =0;
__local float As [BLOCK_SIZE][BLOCK_SIZE];
__local float Bs [BLOCK_SIZE][BLOCK_SIZE];
for (int ok = 0; ok < nk; ok += BLOCK_SIZE ) {
As[ti][tj] = Aik[ nk*(gi ) + tj + ok ]; // A[i][k]
Bs[ti][tj] = Bkj[ nj*(ti+ok) + gj ]; // B[k][j]
barrier(CLK_LOCAL_MEM_FENCE);
for (int k = 0; k < BLOCK_SIZE; ++k) Csub += As[ti][k] * Bs[k][tj];
barrier(CLK_LOCAL_MEM_FENCE);
}
Cij[ nj * ( gi ) + gj ] = Csub;
}
NOTE - the strange BLOCK_SIZE=22 is the maximum BLOCK_SIZE which does fit to max work_group_size which is 512 on my GPU. In this code must hold condition BLOCK_SIZE^2 < max work_group_size. 22=int(sqrt(512)). I tried also BLOCK_SIZE=16 or 8 but it was slower tan with 22.
I also tried simple matrixMul (without using local memory) but it was even 10-times slower than numpy.dot().
I copied the code here
http://gpgpu-computing4.blogspot.cz/2009/10/matrix-multiplication-3-opencl.html
they say that even the simple version (without local memory) should run 200x faster than CPU? I don't undrestand that.
the dependecne of performance in my case is:
N = 220 numpy 3.680 [Gflops] GPU 16.428 [Gflops] speedUp 4.464
N = 330 numpy 4.752 [Gflops] GPU 29.487 [Gflops] speedUp 6.205
N = 440 numpy 4.914 [Gflops] GPU 37.096 [Gflops] speedUp 7.548
N = 550 numpy 3.849 [Gflops] GPU 47.019 [Gflops] speedUp 12.217
N = 660 numpy 5.251 [Gflops] GPU 49.999 [Gflops] speedUp 9.522
N = 770 numpy 4.565 [Gflops] GPU 48.567 [Gflops] speedUp 10.638
N = 880 numpy 5.452 [Gflops] GPU 44.444 [Gflops] speedUp 8.152
N = 990 numpy 4.976 [Gflops] GPU 42.187 [Gflops] speedUp 8.478
N = 1100 numpy 5.324 [Gflops] GPU 83.187 [Gflops] speedUp 15.625
N = 1210 numpy 5.401 [Gflops] GPU 57.147 [Gflops] speedUp 10.581
N = 1320 numpy 5.450 [Gflops] GPU 48.936 [Gflops] speedUp 8.979
NOTE - the "Gflops" number is obtained as N^3/time and it does include time required to copy results from GPU to main memory, but this time is just few percent of total time especially for N>1000
maybe more pictorial is time in secons:
N = 220 numpy 0.003 [s] GPU 0.001 [s] load 0.001 [s] speedUp 5.000
N = 330 numpy 0.008 [s] GPU 0.001 [s] load 0.001 [s] speedUp 7.683
N = 440 numpy 0.017 [s] GPU 0.002 [s] load 0.001 [s] speedUp 7.565
N = 550 numpy 0.043 [s] GPU 0.004 [s] load 0.001 [s] speedUp 11.957
N = 660 numpy 0.055 [s] GPU 0.006 [s] load 0.002 [s] speedUp 9.298
N = 770 numpy 0.100 [s] GPU 0.009 [s] load 0.003 [s] speedUp 10.638
N = 880 numpy 0.125 [s] GPU 0.010 [s] load 0.000 [s] speedUp 12.097
N = 990 numpy 0.195 [s] GPU 0.015 [s] load 0.000 [s] speedUp 12.581
N = 1100 numpy 0.250 [s] GPU 0.031 [s] load 0.000 [s] speedUp 8.065
N = 1210 numpy 0.328 [s] GPU 0.031 [s] load 0.000 [s] speedUp 10.581
N = 1320 numpy 0.422 [s] GPU 0.047 [s] load 0.000 [s] speedUp 8.979
I was thinking that maybe some speed improvement can be obtained using
async_work_group_copy or even read_imageui to copy blocks to local memory. But I don't understand why I have so big difference when I'm using basically the same code as people who say they have 200x speedup?????
Without even looking at your code let me make some comments about your benchmarks. Let's ignore numpy and compare the maximum SP FLOPs/s and DP FLOPs/s of an Intel CPU versus Nvidia and AMD GPUs.
A Intel 2600K at 4 GHz can do 4 GHz * (8 AVX) * (2 ILP) * ( 4 cores) = 256 SP GFLOPs/s. For DP it's half: 128 DP GFLOPs/s. Haswell which comes out in a few weeks will double both of those. The Intel MKL library gets better than 80% efficiency in GEMM. My own GEMM code gets 70% on my i7-2700 so the 5 GFlops/s you quote with numpy is tiny and not fair to compare with.
I don't know what the GTX 275 is capable of but I would guess it's much more than 50 GFLOPs/s.
The article you reference compares a AMD 7970. They get 848 (90% efficiency) DP GFlops/s and 2646 (70% efficiency) SP GFlops/s. That's closer to 10x the performance of the CPU not 200x!
Edit:
Your calculations of FLOPs is wrong it should be 2.0*n^3. That's still approximate but it's asymptotically true. Let me explain.
Consider a 3D dot product. It's x1*x2+y1*y2+z1*z2. That's 3 multiplications and two additions. So a N-dimensional dot product is n multiplications and (n-1) additions. A matrix product is equivalent to nxn dot products, i.e. n*n*n multiplications and n*n*(n-1) additions. That's approximately 2.0*n^3 FLOPS. So you should double all your Gflops/s numbers.
Edit:
You might want to consider the kernel time. It's been awhile since I used OpenCL but using the C++ bindings I did something like this
queue = cl::CommandQueue(context, devices[device], CL_QUEUE_PROFILING_ENABLE|CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &err);
//other code...run kernel
time_end = clevent.getProfilingInfo<CL_PROFILING_COMMAND_END>();
time_start = clevent.getProfilingInfo<CL_PROFILING_COMMAND_START>();
A good GPU matrix-multiply does not just use local memory, it stores blocks of A, B, and/or C in registers (which results in higher register usage and lower occupancy but is much faster in the end). This is because GPUs have more registers than local memory (128-256KB vs 48KB for NVIDIA), and registers offer as much bandwidth as the ALUs can handle.

Profiling a Haskell program

I have a piece of code that repeatedly samples from a probability distribution using sequence. Morally, it does something like this:
sampleMean :: MonadRandom m => Int -> m Float -> m Float
sampleMean n dist = do
xs <- sequence (replicate n dist)
return (sum xs)
Except that it's a bit more complicated. The actual code I'm interested in is the function likelihoodWeighting at this Github repo.
I noticed that the running time scales nonlinearly with n. In particular, once n exceeds a certain value it hits the memory limit, and the running time explodes. I'm not certain, but I think this is because sequence is building up a long list of thunks which aren't getting evaluated until the call to sum.
Once I get past about 100,000 samples, the program slows to a crawl. I'd like to optimize this (my feeling is that 10 million samples shouldn't be a problem) so I decided to profile it - but I'm having a little trouble understanding the output of the profiler.
Profiling
I created a short executable in a file main.hs that runs my function with 100,000 samples. Here's the output from doing
$ ghc -O2 -rtsopts main.hs
$ ./main +RTS -s
First things I notice - it allocates nearly 1.5 GB of heap, and spends 60% of its time on garbage collection. Is this generally indicative of too much laziness?
1,377,538,232 bytes allocated in the heap
1,195,050,032 bytes copied during GC
169,411,368 bytes maximum residency (12 sample(s))
7,360,232 bytes maximum slop
423 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 2574 collections, 0 parallel, 2.40s, 2.43s elapsed
Generation 1: 12 collections, 0 parallel, 1.07s, 1.28s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.92s ( 1.94s elapsed)
GC time 3.47s ( 3.70s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.23s ( 0.23s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 5.63s ( 5.87s elapsed)
%GC time 61.8% (63.1% elapsed)
Alloc rate 716,368,278 bytes per MUT second
Productivity 34.2% of total user, 32.7% of total elapsed
Here are the results from
$ ./main +RTS -p
The first time I ran this, it turned out that there was one function being called repeatedly, and it turned out I could memoize it, which sped things up by a factor of 2. It didn't solve the space leak, however.
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 1 0 0.0 0.0 100.0 100.0
main Main 434 4 0.0 0.0 100.0 100.0
likelihoodWeighting AI.Probability.Bayes 445 1 0.0 0.3 100.0 100.0
distributionLW AI.Probability.Bayes 448 1 0.0 2.6 0.0 2.6
getSampleLW AI.Probability.Bayes 446 100000 20.0 50.4 100.0 97.1
bnProb AI.Probability.Bayes 458 400000 0.0 0.0 0.0 0.0
bnCond AI.Probability.Bayes 457 400000 6.7 0.8 6.7 0.8
bnVals AI.Probability.Bayes 455 400000 20.0 6.3 26.7 7.1
bnParents AI.Probability.Bayes 456 400000 6.7 0.8 6.7 0.8
bnSubRef AI.Probability.Bayes 454 800000 13.3 13.5 13.3 13.5
weightedSample AI.Probability.Bayes 447 100000 26.7 23.9 33.3 25.3
bnProb AI.Probability.Bayes 453 100000 0.0 0.0 0.0 0.0
bnCond AI.Probability.Bayes 452 100000 0.0 0.2 0.0 0.2
bnVals AI.Probability.Bayes 450 100000 0.0 0.3 6.7 0.5
bnParents AI.Probability.Bayes 451 100000 6.7 0.2 6.7 0.2
bnSubRef AI.Probability.Bayes 449 200000 0.0 0.7 0.0 0.7
Here's a heap profile. I don't know why it claims the runtime is 1.8 seconds - this run took about 6 seconds.
Can anyone help me to interpret the output of the profiler - i.e. to identify where the bottleneck is, and provide suggestions for how to speed things up?
A huge improvement has already been achieved by incorporating JohnL's suggestion of using foldM in likelihoodWeighting. That reduced memory usage about tenfold here, and brought down the GC times significantly to almost or actually negligible.
A profiling run with the current source yields
probabilityIO AI.Util.Util 26.1 42.4 413 290400000
weightedSample.go AI.Probability.Bayes 16.1 19.1 255 131200080
bnParents AI.Probability.Bayes 10.8 1.2 171 8000384
bnVals AI.Probability.Bayes 10.4 7.8 164 53603072
bnCond AI.Probability.Bayes 7.9 1.2 125 8000384
ndSubRef AI.Util.Array 4.8 9.2 76 63204112
bnSubRef AI.Probability.Bayes 4.7 8.1 75 55203072
likelihoodWeighting.func AI.Probability.Bayes 3.3 2.8 53 19195128
%! AI.Util.Util 3.3 0.5 53 3200000
bnProb AI.Probability.Bayes 2.5 0.0 40 16
bnProb.p AI.Probability.Bayes 2.5 3.5 40 24001152
likelihoodWeighting AI.Probability.Bayes 2.5 2.9 39 20000264
likelihoodWeighting.func.x AI.Probability.Bayes 2.3 0.2 37 1600000
and 13MB memory usage reported by -s, ~5MB maximum residency. That's not too bad already.
Still, there remain some points we can improve. First, a relatively minor thing, in the grand scheme, AI.UTIl.Array.ndSubRef:
ndSubRef :: [Int] -> Int
ndSubRef ns = sum $ zipWith (*) (reverse ns) (map (2^) [0..])
Reversing the list, and mapping (2^) over another list is inefficient, better is
ndSubRef = L.foldl' (\a d -> 2*a + d) 0
which doesn't need to keep the entire list in memory (probably not a big deal, since the lists will be short) as reversing it does, and doesn't need to allocate a second list. The reduction in allocation is noticeable, about 10%, and that part runs measurably faster,
ndSubRef AI.Util.Array 1.7 1.3 24 8000384
in the profile of the modified run, but since it takes only a small part of the overall time, the overall impact is small. There are potentially bigger fish to fry in weightedSample and likelihoodWeighting.
Let's add a bit of strictness in weightedSample to see how that changes things:
weightedSample :: Ord e => BayesNet e -> [(e,Bool)] -> IO (Map e Bool, Prob)
weightedSample bn fixed =
go 1.0 (M.fromList fixed) (bnVars bn)
where
go w assignment [] = return (assignment, w)
go w assignment (v:vs) = if v `elem` vars
then
let w' = w * bnProb bn assignment (v, fixed %! v)
in go w' assignment vs
else do
let p = bnProb bn assignment (v,True)
x <- probabilityIO p
go w (M.insert v x assignment) vs
vars = map fst fixed
The weight parameter of go is never forced, nor is the assignment parameter, thus they can build up thunks. Let's enable {-# LANGUAGE BangPatterns #-} and force updates to take effect immediately, also evaluate p before passing it to probabilityIO:
go w assignment (v:vs) = if v `elem` vars
then
let !w' = w * bnProb bn assignment (v, fixed %! v)
in go w' assignment vs
else do
let !p = bnProb bn assignment (v,True)
x <- probabilityIO p
let !assignment' = M.insert v x assignment
go w assignment' vs
That brings a further reduction in allocation (~9%) and a small speedup (~%13%), but the total memory usage and maximum residence haven't changed much.
I see nothing else obvious to change there, so let's look at likelihoodWeighting:
func m _ = do
(a, w) <- weightedSample bn fixed
let x = a ! e
return $! x `seq` w `seq` M.adjust (+w) x m
In the last line, first, w is already evaluated in weightedSample now, so we don't need to seq it here, the key x is required to evaluate the updated map, so seqing that isn't necessary either. The bad thing on that line is M.adjust. adjust has no way of forcing the result of the updated function, so that builds thunks in the map's values. You can force evaluation of the thunks by looking up the modified value and forcing that, but Data.Map provides a much more convenient way here, since the key at which the map is updated is guaranteed to be present, insertWith':
func !m _ = do
(a, w) <- weightedSample bn fixed
let x = a ! e
return (M.insertWith' (+) x w m)
(Note: GHC optimises better with a bang-pattern on m than with return $! ... here). That slightly reduces the total allocation and doesn't measurably change the running time, but has a great impact on total memory used and maximum residency:
934,566,488 bytes allocated in the heap
1,441,744 bytes copied during GC
68,112 bytes maximum residency (1 sample(s))
23,272 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
The biggest improvement in running time to be had would be by avoiding randomIO, the used StdGen is very slow.
I am surprised how much time the bn* functions take, but don't see any obvious inefficiency in those.
I have trouble digesting these profiles, but I have gotten my ass kicked before because the MonadRandom on Hackage is strict. Creating a lazy version of MonadRandom made my memory problems go away.
My colleague has not yet gotten permission to release the code, but I've put Control.Monad.LazyRandom online at pastebin. Or if you want to see some excerpts that explain a fully lazy random search, including infinite lists of random computations, check out Experience Report: Haskell in Computational Biology.
I put together a very elementary example, posted here: http://hpaste.org/71919. I'm not sure if it's anything like your example.. just a very minimal thing that seemed to work.
Compiling with -prof and -fprof-auto and running with 100000 iterations yielded the following head of the profiling output (pardon my line numbers):
8 COST CENTRE MODULE %time %alloc
9
10 sample AI.Util.ProbDist 31.5 36.6
11 bnParents AI.Probability.Bayes 23.2 0.0
12 bnRank AI.Probability.Bayes 10.7 23.7
13 weightedSample.go AI.Probability.Bayes 9.6 13.4
14 bnVars AI.Probability.Bayes 8.6 16.2
15 likelihoodWeighting AI.Probability.Bayes 3.8 4.2
16 likelihoodWeighting.getSample AI.Probability.Bayes 2.1 0.7
17 sample.cumulative AI.Util.ProbDist 1.7 2.1
18 bnCond AI.Probability.Bayes 1.6 0.0
19 bnRank.ps AI.Probability.Bayes 1.1 0.0
And here are the summary statistics:
1,433,944,752 bytes allocated in the heap
1,016,435,800 bytes copied during GC
176,719,648 bytes maximum residency (11 sample(s))
1,900,232 bytes maximum slop
400 MB total memory in use (0 MB lost due to fragmentation)
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.40s ( 1.41s elapsed)
GC time 1.08s ( 1.24s elapsed)
Total time 2.47s ( 2.65s elapsed)
%GC time 43.6% (46.8% elapsed)
Alloc rate 1,026,674,336 bytes per MUT second
Productivity 56.4% of total user, 52.6% of total elapsed
Notice that the profiler pointed its finger at sample. I forced the return in that function by using $!, and here are some summary statistics afterwards:
1,776,908,816 bytes allocated in the heap
165,232,656 bytes copied during GC
34,963,136 bytes maximum residency (7 sample(s))
483,192 bytes maximum slop
68 MB total memory in use (0 MB lost due to fragmentation)
INIT time 0.00s ( 0.00s elapsed)
MUT time 2.42s ( 2.44s elapsed)
GC time 0.21s ( 0.23s elapsed)
Total time 2.63s ( 2.68s elapsed)
%GC time 7.9% (8.8% elapsed)
Alloc rate 733,248,745 bytes per MUT second
Productivity 92.1% of total user, 90.4% of total elapsed
Much more productive in terms of GC, but not much changed on the time. You might be able to keep iterating in this profile/tweak fashion to target your bottlenecks and eke out some better performance.
I think your initial diagnosis is correct, and I've never seen a profiling report that's useful once memory effects kick in.
The problem is that you're traversing the list twice, once for sequence and again for sum. In Haskell, multiple list traversals of large lists are really, really bad for performance. The solution is generally to use some type of fold, such as foldM. Your sampleMean function can be written as
{-# LANGUAGE BangPatterns #-}
sampleMean2 :: MonadRandom m => Int -> m Float -> m Float
sampleMean2 n dist = foldM (\(!a) mb -> liftM (+a) mb) 0 $ replicate n dist
for example, traversing the list only once.
You can do the same sort of thing with likelihoodWeighting as well. In order to prevent thunks, it's important to make sure that the accumulator in your fold function has appropriate strictness.

Calculating CPI Stall

In an example,
I don't really get how CPI stall is calculated here. I think CPI Stall = CPI Ideal + Memory Stall Cycles (At least this was given)?
From what I understand from the question: 2 = CPI Ideal. 0.02 = L1 miss rate. 25 = miss penalty. (but isnt this miss penalty for L2 cache?). .36 is num of memory instructions (why is it not .36 x .02 x 25 earlier?). .04 = ?? the 4% in braces? what does that mean? .005 = L2 miss rate.
I figured that the reason why 0.02 * 25 and 0.005 * 100 is without the reads/writes per prog is because the Instruction cache is always read, thus its 1 * ... where 1 can be omited

Resources