Calculating CPI Stall - caching

In an example,
I don't really get how CPI stall is calculated here. I think CPI Stall = CPI Ideal + Memory Stall Cycles (At least this was given)?
From what I understand from the question: 2 = CPI Ideal. 0.02 = L1 miss rate. 25 = miss penalty. (but isnt this miss penalty for L2 cache?). .36 is num of memory instructions (why is it not .36 x .02 x 25 earlier?). .04 = ?? the 4% in braces? what does that mean? .005 = L2 miss rate.

I figured that the reason why 0.02 * 25 and 0.005 * 100 is without the reads/writes per prog is because the Instruction cache is always read, thus its 1 * ... where 1 can be omited

Related

Miss rate calculation

I have this problem:
A program that calculates the sum of 128x128 matrix of 32-bit integers (by rows). I have one-way cache that has 8 sets with block size of 64 bytes, considering only the access to the matrix not the instruction.
I should calculate its miss rate.
And also the miss rate by reading the matrix by column. Sorry if there are grammar mistakes, I only translated it to English.
What I've done so far is that (correct me if I'm wrong):
Integer size = 4B
64/4 = 16 (integers inside a block)
128/16 = 8 (blocks per row)
15 hit and 1 miss (each block)
120 hit and 8 miss (each row)
960 hit and 64 miss (all the matrix)
miss rate = 64/1024 = 0.06 = 6%

Calculating Cache miss rate - set associative

Piece of MIPS code:
.data
first-node : .word 10,next-node
next-node: .......
.text
la $s0,first-node
move $s1,$zero
loop:
la $t0,0($s0)
add $s1,$s1,$t0
lw $s0,4($s0)
bne $s0,$zero,loop
we got list of 4096 nodes,
Cache one-way
16 set
Block size: 128 byte
Virtual Memory with page size of 4KB
1) Find out the miss rate of the cache
The miss rate first 16 operation?
**The answer:**4/4+2 (4 because there are 4 instrations in the loop and 2 before loop) - 4/6 = 66 %
the miss rate 17th Iteration
The answer:
1/6 (1 because it loads a new tree (I think) and 6 is total instrations) = 17%
The miss rate 18th - 32th iteration?
The answer:
0/6 (I didn't understand why 0 if we have to load the next tree) = 0%
And the total miss rate?The answer:
4*16+1*15/16*16*6 = 5% (This part I coudn't understand)
2) How many page fault were executed?
The answer:
9 page fault
3) What is the miss rate in TLB?
The answer:
9/6*4096 0,03%
We answered all those questions in the class but, I didn't understand many of them can somebody please explain me?
Also What will be the difference between the one-way and 2-way? If it's bigger should be less miss rate,yes? the problem is how to calculate?

Optimizing a program and calculating % of total execution time improved

So I was told to ask this on here instead of StackExchage:
If I have a program P, which runs on a 2GHz machine M in 30seconds and is optimized by replacing all instances of 'raise to the power 4' with 3 instructions of multiplying x by. This optimized program will be P'. The CPI of multiplication is 2 and CPI of power is 12. If there are 10^9 such operations optimized, what is the percent of total execution time improved?
Here is what I've deduced so far.
For P, we have:
time (30s)
CPI: 12
Frequency (2GHz)
For P', we have:
CPI (6) [2*3]
Frequency (2GHz)
So I need to figure our how to calculate the time of P' in order to compare the times. But I have no idea how to achieve this. Could someone please help me out?
Program P, which runs on a 2GHz machine M in 30 seconds and is optimized by replacing all instances of 'raise to the power 4' with 3 instructions of multiplying x by. This optimized program will be P'. The CPI of multiplication is 2 and CPI of power is 12. If there are 10^9 such operations optimized,
From this information we can compute time needed to execute all POWER4 ("raise to the power 4) instructions, we have total count of such instructions (all POWER4 was replaced, count is 10^9 or 1 G). Every POWER4 instruction needs 12 clock cycles (CPI = clock per instruction), so all POWER4 were executed in 1G * 12 = 12G cycles.
2GHz machine has 2G cycles per second, and there are 30 seconds of execution. Total P program execution is 2G*30 = 60 G cycles (60 * 10^9). We can conclude that P program has some other instructions. We don't know what instructions, how many executions they have and there is no information about their mean CPI. But we know that time needed to execute other instructions is 60 G - 12 G = 48 G (total program running time minus POWER4 running time - true for simple processors). There is some X executed instructions with Y mean CPI, so X*Y = 48 G.
So, total cycles executed for the program P is
Freq * seconds = POWER4_count * POWER4_CPI + OTHER_count * OTHER_mean_CPI
2G * 30 = 1G * 12 + X*Y
Or total running time for P:
30s = (1G * 12 + X*Y) / 2GHz
what is the percent of total execution time improved?
After replacing 1G POWER4 operations with 3 times more MUL instructions (multiply by) we have 3G MUL operations, and cycles needed for them is now CPI * count, where MUL CPI is 2: 2*3G = 6G cycles. X*Y part of P' was unchanged, and we can solve the problem.
P' time in seconds = ( MUL_count * MUL_CPI + OTHER_count * OTHER_mean_CPI ) / Frequency
P' time = (3G*2 + X*Y) / 2GHz
Improvement is not so big as can be excepted, because POWER4 instructions in P takes only some part of running time: 12G/60G; and optimization converted 12G to 6G, without changing remaining 48 G cycles part. By halving only some part of time we get not half of time.

Profiling a Haskell program

I have a piece of code that repeatedly samples from a probability distribution using sequence. Morally, it does something like this:
sampleMean :: MonadRandom m => Int -> m Float -> m Float
sampleMean n dist = do
xs <- sequence (replicate n dist)
return (sum xs)
Except that it's a bit more complicated. The actual code I'm interested in is the function likelihoodWeighting at this Github repo.
I noticed that the running time scales nonlinearly with n. In particular, once n exceeds a certain value it hits the memory limit, and the running time explodes. I'm not certain, but I think this is because sequence is building up a long list of thunks which aren't getting evaluated until the call to sum.
Once I get past about 100,000 samples, the program slows to a crawl. I'd like to optimize this (my feeling is that 10 million samples shouldn't be a problem) so I decided to profile it - but I'm having a little trouble understanding the output of the profiler.
Profiling
I created a short executable in a file main.hs that runs my function with 100,000 samples. Here's the output from doing
$ ghc -O2 -rtsopts main.hs
$ ./main +RTS -s
First things I notice - it allocates nearly 1.5 GB of heap, and spends 60% of its time on garbage collection. Is this generally indicative of too much laziness?
1,377,538,232 bytes allocated in the heap
1,195,050,032 bytes copied during GC
169,411,368 bytes maximum residency (12 sample(s))
7,360,232 bytes maximum slop
423 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 2574 collections, 0 parallel, 2.40s, 2.43s elapsed
Generation 1: 12 collections, 0 parallel, 1.07s, 1.28s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.92s ( 1.94s elapsed)
GC time 3.47s ( 3.70s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.23s ( 0.23s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 5.63s ( 5.87s elapsed)
%GC time 61.8% (63.1% elapsed)
Alloc rate 716,368,278 bytes per MUT second
Productivity 34.2% of total user, 32.7% of total elapsed
Here are the results from
$ ./main +RTS -p
The first time I ran this, it turned out that there was one function being called repeatedly, and it turned out I could memoize it, which sped things up by a factor of 2. It didn't solve the space leak, however.
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 1 0 0.0 0.0 100.0 100.0
main Main 434 4 0.0 0.0 100.0 100.0
likelihoodWeighting AI.Probability.Bayes 445 1 0.0 0.3 100.0 100.0
distributionLW AI.Probability.Bayes 448 1 0.0 2.6 0.0 2.6
getSampleLW AI.Probability.Bayes 446 100000 20.0 50.4 100.0 97.1
bnProb AI.Probability.Bayes 458 400000 0.0 0.0 0.0 0.0
bnCond AI.Probability.Bayes 457 400000 6.7 0.8 6.7 0.8
bnVals AI.Probability.Bayes 455 400000 20.0 6.3 26.7 7.1
bnParents AI.Probability.Bayes 456 400000 6.7 0.8 6.7 0.8
bnSubRef AI.Probability.Bayes 454 800000 13.3 13.5 13.3 13.5
weightedSample AI.Probability.Bayes 447 100000 26.7 23.9 33.3 25.3
bnProb AI.Probability.Bayes 453 100000 0.0 0.0 0.0 0.0
bnCond AI.Probability.Bayes 452 100000 0.0 0.2 0.0 0.2
bnVals AI.Probability.Bayes 450 100000 0.0 0.3 6.7 0.5
bnParents AI.Probability.Bayes 451 100000 6.7 0.2 6.7 0.2
bnSubRef AI.Probability.Bayes 449 200000 0.0 0.7 0.0 0.7
Here's a heap profile. I don't know why it claims the runtime is 1.8 seconds - this run took about 6 seconds.
Can anyone help me to interpret the output of the profiler - i.e. to identify where the bottleneck is, and provide suggestions for how to speed things up?
A huge improvement has already been achieved by incorporating JohnL's suggestion of using foldM in likelihoodWeighting. That reduced memory usage about tenfold here, and brought down the GC times significantly to almost or actually negligible.
A profiling run with the current source yields
probabilityIO AI.Util.Util 26.1 42.4 413 290400000
weightedSample.go AI.Probability.Bayes 16.1 19.1 255 131200080
bnParents AI.Probability.Bayes 10.8 1.2 171 8000384
bnVals AI.Probability.Bayes 10.4 7.8 164 53603072
bnCond AI.Probability.Bayes 7.9 1.2 125 8000384
ndSubRef AI.Util.Array 4.8 9.2 76 63204112
bnSubRef AI.Probability.Bayes 4.7 8.1 75 55203072
likelihoodWeighting.func AI.Probability.Bayes 3.3 2.8 53 19195128
%! AI.Util.Util 3.3 0.5 53 3200000
bnProb AI.Probability.Bayes 2.5 0.0 40 16
bnProb.p AI.Probability.Bayes 2.5 3.5 40 24001152
likelihoodWeighting AI.Probability.Bayes 2.5 2.9 39 20000264
likelihoodWeighting.func.x AI.Probability.Bayes 2.3 0.2 37 1600000
and 13MB memory usage reported by -s, ~5MB maximum residency. That's not too bad already.
Still, there remain some points we can improve. First, a relatively minor thing, in the grand scheme, AI.UTIl.Array.ndSubRef:
ndSubRef :: [Int] -> Int
ndSubRef ns = sum $ zipWith (*) (reverse ns) (map (2^) [0..])
Reversing the list, and mapping (2^) over another list is inefficient, better is
ndSubRef = L.foldl' (\a d -> 2*a + d) 0
which doesn't need to keep the entire list in memory (probably not a big deal, since the lists will be short) as reversing it does, and doesn't need to allocate a second list. The reduction in allocation is noticeable, about 10%, and that part runs measurably faster,
ndSubRef AI.Util.Array 1.7 1.3 24 8000384
in the profile of the modified run, but since it takes only a small part of the overall time, the overall impact is small. There are potentially bigger fish to fry in weightedSample and likelihoodWeighting.
Let's add a bit of strictness in weightedSample to see how that changes things:
weightedSample :: Ord e => BayesNet e -> [(e,Bool)] -> IO (Map e Bool, Prob)
weightedSample bn fixed =
go 1.0 (M.fromList fixed) (bnVars bn)
where
go w assignment [] = return (assignment, w)
go w assignment (v:vs) = if v `elem` vars
then
let w' = w * bnProb bn assignment (v, fixed %! v)
in go w' assignment vs
else do
let p = bnProb bn assignment (v,True)
x <- probabilityIO p
go w (M.insert v x assignment) vs
vars = map fst fixed
The weight parameter of go is never forced, nor is the assignment parameter, thus they can build up thunks. Let's enable {-# LANGUAGE BangPatterns #-} and force updates to take effect immediately, also evaluate p before passing it to probabilityIO:
go w assignment (v:vs) = if v `elem` vars
then
let !w' = w * bnProb bn assignment (v, fixed %! v)
in go w' assignment vs
else do
let !p = bnProb bn assignment (v,True)
x <- probabilityIO p
let !assignment' = M.insert v x assignment
go w assignment' vs
That brings a further reduction in allocation (~9%) and a small speedup (~%13%), but the total memory usage and maximum residence haven't changed much.
I see nothing else obvious to change there, so let's look at likelihoodWeighting:
func m _ = do
(a, w) <- weightedSample bn fixed
let x = a ! e
return $! x `seq` w `seq` M.adjust (+w) x m
In the last line, first, w is already evaluated in weightedSample now, so we don't need to seq it here, the key x is required to evaluate the updated map, so seqing that isn't necessary either. The bad thing on that line is M.adjust. adjust has no way of forcing the result of the updated function, so that builds thunks in the map's values. You can force evaluation of the thunks by looking up the modified value and forcing that, but Data.Map provides a much more convenient way here, since the key at which the map is updated is guaranteed to be present, insertWith':
func !m _ = do
(a, w) <- weightedSample bn fixed
let x = a ! e
return (M.insertWith' (+) x w m)
(Note: GHC optimises better with a bang-pattern on m than with return $! ... here). That slightly reduces the total allocation and doesn't measurably change the running time, but has a great impact on total memory used and maximum residency:
934,566,488 bytes allocated in the heap
1,441,744 bytes copied during GC
68,112 bytes maximum residency (1 sample(s))
23,272 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
The biggest improvement in running time to be had would be by avoiding randomIO, the used StdGen is very slow.
I am surprised how much time the bn* functions take, but don't see any obvious inefficiency in those.
I have trouble digesting these profiles, but I have gotten my ass kicked before because the MonadRandom on Hackage is strict. Creating a lazy version of MonadRandom made my memory problems go away.
My colleague has not yet gotten permission to release the code, but I've put Control.Monad.LazyRandom online at pastebin. Or if you want to see some excerpts that explain a fully lazy random search, including infinite lists of random computations, check out Experience Report: Haskell in Computational Biology.
I put together a very elementary example, posted here: http://hpaste.org/71919. I'm not sure if it's anything like your example.. just a very minimal thing that seemed to work.
Compiling with -prof and -fprof-auto and running with 100000 iterations yielded the following head of the profiling output (pardon my line numbers):
8 COST CENTRE MODULE %time %alloc
9
10 sample AI.Util.ProbDist 31.5 36.6
11 bnParents AI.Probability.Bayes 23.2 0.0
12 bnRank AI.Probability.Bayes 10.7 23.7
13 weightedSample.go AI.Probability.Bayes 9.6 13.4
14 bnVars AI.Probability.Bayes 8.6 16.2
15 likelihoodWeighting AI.Probability.Bayes 3.8 4.2
16 likelihoodWeighting.getSample AI.Probability.Bayes 2.1 0.7
17 sample.cumulative AI.Util.ProbDist 1.7 2.1
18 bnCond AI.Probability.Bayes 1.6 0.0
19 bnRank.ps AI.Probability.Bayes 1.1 0.0
And here are the summary statistics:
1,433,944,752 bytes allocated in the heap
1,016,435,800 bytes copied during GC
176,719,648 bytes maximum residency (11 sample(s))
1,900,232 bytes maximum slop
400 MB total memory in use (0 MB lost due to fragmentation)
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.40s ( 1.41s elapsed)
GC time 1.08s ( 1.24s elapsed)
Total time 2.47s ( 2.65s elapsed)
%GC time 43.6% (46.8% elapsed)
Alloc rate 1,026,674,336 bytes per MUT second
Productivity 56.4% of total user, 52.6% of total elapsed
Notice that the profiler pointed its finger at sample. I forced the return in that function by using $!, and here are some summary statistics afterwards:
1,776,908,816 bytes allocated in the heap
165,232,656 bytes copied during GC
34,963,136 bytes maximum residency (7 sample(s))
483,192 bytes maximum slop
68 MB total memory in use (0 MB lost due to fragmentation)
INIT time 0.00s ( 0.00s elapsed)
MUT time 2.42s ( 2.44s elapsed)
GC time 0.21s ( 0.23s elapsed)
Total time 2.63s ( 2.68s elapsed)
%GC time 7.9% (8.8% elapsed)
Alloc rate 733,248,745 bytes per MUT second
Productivity 92.1% of total user, 90.4% of total elapsed
Much more productive in terms of GC, but not much changed on the time. You might be able to keep iterating in this profile/tweak fashion to target your bottlenecks and eke out some better performance.
I think your initial diagnosis is correct, and I've never seen a profiling report that's useful once memory effects kick in.
The problem is that you're traversing the list twice, once for sequence and again for sum. In Haskell, multiple list traversals of large lists are really, really bad for performance. The solution is generally to use some type of fold, such as foldM. Your sampleMean function can be written as
{-# LANGUAGE BangPatterns #-}
sampleMean2 :: MonadRandom m => Int -> m Float -> m Float
sampleMean2 n dist = foldM (\(!a) mb -> liftM (+a) mb) 0 $ replicate n dist
for example, traversing the list only once.
You can do the same sort of thing with likelihoodWeighting as well. In order to prevent thunks, it's important to make sure that the accumulator in your fold function has appropriate strictness.

Explain how cpu time was computed

Question: When a CPU can perform a multiplication in 12 nanoseconds (ns), an addition in 1 ns,
and a subtraction in 1.5 ns, which of the following is the minimum CPU time, in
nanoseconds, for the calculation of “a×a – b×b ” ?
Answer: 14.5
I believe it optimizes the equation to (a-b)*(a+b) so it's subtraction + addition + multiplication = 12 + 1 + 1.5 = 14.5.
Though my math isn't the best around so if I'm wrong just comment and don't downvote so I can delete :D

Resources