What does +-# after percent of cache misses mean in perf stat? - caching

I used perf stat --repeat 100 -e cache-references,cache- misses,cycles,instructions,branches,faults,migrations ./avx2ADD command and the output is followed. What does +- 8.93% for cache-misses mean when percent of cache misses is equal to 4.010 %?
32,425 cache-references ( +- 0.54% )
1,300 cache-misses # 4.010 % of all cache refs ( +- 8.93% )
538,839 cycles ( +- 0.28% )
520,056 instructions # 0.97 insns per cycle ( +- 0.22% )
98,720 branches ( +- 0.20% )
95 faults ( +- 0.12% )
0 migrations ( +- 70.35% )

The +- 8.93% part is described in manual page:
-r, --repeat=
repeat command and print average + stddev (max: 100). 0 means forever.
If you are not sure what is meant be abbreviation of stddev, it it Standard Deviation (yes, also there could be manual page more verbose). In short, how much the results differ from all the repeated measurements. Less value is better, but if you have such a small problem (500k instructions), deviation will be larger, because cache-misses might be non-deterministic.
The percentage 4.010 % then means the average from the description above.

Related

Data structure to achieve random delete and insert where elements are weighted in [a,b]

I would like to design a data structure and algorithm such that, given an array of elements, where each element has a weight according to [a,b], I can achieve constant time insertion and deletion. The deletion is performed randomly where the probability of an element being deleted is proportional to its weight.
I do not believe there is a deterministic algorithm that can achieve both operations in constant time, but I think there are there randomized algorithms that should be can accomplish this?
I don't know if O(1) worst-case time is impossible; I don't see any particular reason it should be. But it's definitely possible to have a simple data structure which achieves O(1) expected time.
The idea is to store a dynamic array of pairs (or two parallel arrays), where each item is paired with its weight; insertion is done by appending in O(1) amortised time, and an element can be removed by index by swapping it with the last element so that it can be removed from the end of the array in O(1) time. To sample a random element from the weighted distribution, choose a random index and generate a random number in the half-open interval [0, 2); if it is less than the element's weight, select the element at that index, otherwise repeat this process until an element is selected. The idea is that each index is equally likely to be chosen, and the probability it gets kept rather than rejected is proportional to its weight.
This is a Las Vegas algorithm, meaning it is expected to complete in a finite time, but with very low probability it can take arbitrarily long to complete. The number of iterations required to sample an element will be highest when every weight is exactly 1, in which case it follows a geometric distribution with parameter p = 1/2, so its expected value is 2, a constant which is independent of the number of elements in the data structure.
In general, if all weights are in an interval [a, b] for real numbers 0 < a <= b, then the expected number of iterations is at most b/a. This is always a constant, but it is potentially a large constant (i.e. it takes many iterations to select a single sample) if the lower bound a is small relative to b.
This is not an answer per se, but just a tiny example to illustrate the algorithm devised by #kaya3
| value | weight |
| v1 | 1.0 |
| v2 | 1.5 |
| v3 | 1.5 |
| v4 | 2.0 |
| v5 | 1.0 |
| total | 7.0 |
The total weight is 7.0. It's easy to maintain in O(1) by storing it in some memory and increasing/decreasing at each insertion/removal.
The probability of each element is simply it's weight divided by total weight.
| value | proba |
| v1 | 1.0/7 | 0.1428...
| v2 | 1.5/7 | 0.2142...
| v3 | 1.5/7 | 0.2142...
| v4 | 2.0/7 | 0.2857...
| v5 | 1.0/7 | 0.1428...
Using the algorithm of #kaya3, if we draw a random index, then the probability of each value is 1/size (1/5 here).
The chance of being rejected is 50% for v1, 25% for v2 and 0% for v4. So at first round, the probability to be selected are:
| value | proba |
| v1 | 2/20 | 0.10
| v2 | 3/20 | 0.15
| v3 | 3/20 | 0.15
| v4 | 4/20 | 0.20
| v5 | 2/20 | 0.10
| total | 14/20 | (70%)
Then the proba of having a 2nd round is 30%, and the proba of each index is 6/20/5 = 3/50
| value | proba 2 rounds |
| v1 | 2/20 + 6/200 | 0.130
| v2 | 3/20 + 9/200 | 0.195
| v3 | 3/20 + 9/200 | 0.195
| v4 | 4/20 + 12/200 | 0.260
| v5 | 2/20 + 6/200 | 0.130
| total | 14/20 + 42/200 | (91%)
The proba to have a 3rd round is 9%, that is 9/500 for each index
| value | proba 3 rounds |
| v1 | 2/20 + 6/200 + 18/2000 | 0.1390
| v2 | 3/20 + 9/200 + 27/2000 | 0.2085
| v3 | 3/20 + 9/200 + 27/2000 | 0.2085
| v4 | 4/20 + 12/200 + 36/2000 | 0.2780
| v5 | 2/20 + 6/200 + 18/2000 | 0.1390
| total | 14/20 + 42/200 + 126/2000 | (97,3%)
So we see that the serie is converging to the correct probabilities. The numerators are multiple of the weight, so it's clear that the relative weight of each element is respected.
This is a sketch of an answer.
With weights only 1, we can maintain a random permutation of the inputs.
Each time an element is inserted, put it at the end of the array, then pick a random position i in the array, and swap the last element with the element at position i.
(It may well be a no-op if the random position turns out to be the last one.)
When deleting, just delete the last element.
Assuming we can use a dynamic array with O(1) (worst case or amortized) insertion and deletion, this does both insertion and deletion in O(1).
With weights 1 and 2, the similar structure may be used.
Perhaps each element of weight 2 should be put twice instead of once.
Perhaps when an element of weight 2 is deleted, its other copy should also be deleted.
So we should in fact store indices instead of the elements, and another array, locations, which stores and tracks the two indices for each element. The swaps should keep this locations array up-to-date.
Deleting an arbitrary element can be done in O(1) similarly to inserting: swap with the last one, delete the last one.

How to apply partial sort on a Spark DataFrame?

The following code:
val myDF = Seq(83, 90, 40, 94, 12, 70, 56, 70, 28, 91).toDF("number")
myDF.orderBy("number").limit(3).show
outputs:
+------+
|number|
+------+
| 12|
| 28|
| 40|
+------+
Does Spark's laziness in combination with the limit call and the implementation of orderBy automatically result in a partially sorted DataFrame, or are the remaining 7 numbers also sorted, even though it's not needed? And if so, is there a way to avoid this needless computational work?
Using .explain() shows, that two sorts stages are performed, first on each partition and then (with the top 3 each) a global one. But it does not state if these sorts are full or partial.
myDF.orderBy("number").limit(3).explain(true)
== Parsed Logical Plan ==
GlobalLimit 3
+- LocalLimit 3
+- Sort [number#3416 ASC NULLS FIRST], true
+- Project [value#3414 AS number#3416]
+- LocalRelation [value#3414]
== Analyzed Logical Plan ==
number: int
GlobalLimit 3
+- LocalLimit 3
+- Sort [number#3416 ASC NULLS FIRST], true
+- Project [value#3414 AS number#3416]
+- LocalRelation [value#3414]
== Optimized Logical Plan ==
GlobalLimit 3
+- LocalLimit 3
+- Sort [number#3416 ASC NULLS FIRST], true
+- LocalRelation [number#3416]
== Physical Plan ==
TakeOrderedAndProject(limit=3, orderBy=[number#3416 ASC NULLS FIRST], output=[number#3416])
+- LocalTableScan [number#3416]
If you explain() your dataframe, you'll find that Spark will first do a "local" sort within each partition, and then pick only top three elements from each for a final global sort before taking the top three out of it.
scala> myDF.orderBy("number").limit(3).explain(true)
== Parsed Logical Plan ==
GlobalLimit 3
+- LocalLimit 3
+- Sort [number#3 ASC NULLS FIRST], true
+- Project [value#1 AS number#3]
+- LocalRelation [value#1]
== Analyzed Logical Plan ==
number: int
GlobalLimit 3
+- LocalLimit 3
+- Sort [number#3 ASC NULLS FIRST], true
+- Project [value#1 AS number#3]
+- LocalRelation [value#1]
== Optimized Logical Plan ==
GlobalLimit 3
+- LocalLimit 3
+- Sort [number#3 ASC NULLS FIRST], true
+- LocalRelation [number#3]
== Physical Plan ==
TakeOrderedAndProject(limit=3, orderBy=[number#3 ASC NULLS FIRST], output=[number#3])
+- LocalTableScan [number#3]
I think its best seen in the Optimized Logical Plan section, but physical says the same thing.
myDF.orderBy("number").limit(3).show
myDF.limit(3).orderBy("number").show
1 => will do full sort and then pick first 3 elements.
2 => will return dataframe with first 3 elements and sort.

Using list generator for memory efficient code in Haskell

I would like to get a handle on writing memory efficient haskell code. One thing I ran across is that there is no dead easy way to make python style list generators/iterators (that I could find).
Small example:
Finding the sum of the integers from 1 to 100000000 without using the closed form formula.
Python that can be done quickly with minimal use of memory as sum(xrange(100000000). In Haskell the analogue would be sum [1..100000000]. However this uses up a lot of memory. I thought using foldl or foldr would be fine but even that uses a lot of memory and is slower than python. Any suggestions?
TL;DR - I think the culprit in this case is - defaulting of GHC to Integer.
Admittedly I do not know enough about python, but my first guess would be that python switches to "bigint" only if necessary - therefore all calculations are done with Int a.k.a. 64-bit integer on my machine.
A first check with
$> ghci
GHCi, version 7.10.3: http://www.haskell.org/ghc/ :? for help
Prelude> maxBound :: Int
9223372036854775807
reveals that the result of the sum (5000000050000000) is less than that number so we have no fear of Int overflow.
I have guessed your example programs to look roughly like this
sum.py
print(sum(xrange(100000000)))
sum.hs
main :: IO ()
main = print $ sum [1..100000000]
To make things explicit I added the type annotation (100000000 :: Integer), compiling it with
$ > stack build --ghc-options="-O2 -with-rtsopts=-sstderr"
and ran your example,
$ > stack exec -- time sum
5000000050000000
3,200,051,872 bytes allocated in the heap
208,896 bytes copied during GC
44,312 bytes maximum residency (2 sample(s))
21,224 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 6102 colls, 0 par 0.013s 0.012s 0.0000s 0.0000s
Gen 1 2 colls, 0 par 0.000s 0.000s 0.0001s 0.0001s
INIT time 0.000s ( 0.000s elapsed)
MUT time 1.725s ( 1.724s elapsed)
GC time 0.013s ( 0.012s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 1.739s ( 1.736s elapsed)
%GC time 0.7% (0.7% elapsed)
Alloc rate 1,855,603,449 bytes per MUT second
Productivity 99.3% of total user, 99.4% of total elapsed
1.72user 0.00system 0:01.73elapsed 99%CPU (0avgtext+0avgdata 4112maxresident)k
and indeed the ~3GB of memory consumption is reproduced.
Changing the annotation to (100000000 :: Int) - altered the behaviour drastically
$ > stack build
$ > stack exec -- time sum
5000000050000000
51,872 bytes allocated in the heap
3,408 bytes copied during GC
44,312 bytes maximum residency (1 sample(s))
17,128 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 0 colls, 0 par 0.000s 0.000s 0.0000s 0.0000s
Gen 1 1 colls, 0 par 0.000s 0.000s 0.0001s 0.0001s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.034s ( 0.034s elapsed)
GC time 0.000s ( 0.000s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.036s ( 0.035s elapsed)
%GC time 0.2% (0.2% elapsed)
Alloc rate 1,514,680 bytes per MUT second
Productivity 99.4% of total user, 102.3% of total elapsed
0.03user 0.00system 0:00.03elapsed 91%CPU (0avgtext+0avgdata 3496maxresident)k
0inputs+0outputs (0major+176minor)pagefaults 0swaps
For the interested
The behaviour of the haskell version does not change a lot if you use libraries like conduit or vector (both boxed and unboxed).
sample programs
sumC.hs
import Data.Conduit
import Data.Conduit.List as CL
main :: IO ()
main = do res <- CL.enumFromTo 1 100000000 $$ CL.fold (+) (0 :: Int)
print res
sumV.hs
import Data.Vector.Unboxed as V
{-import Data.Vector as V-}
main :: IO ()
main = print $ V.sum $ V.enumFromTo (1::Int) 100000000
funny enough the version using
main = print $ V.sum $ V.enumFromN (1::Int) 100000000
is doing worse than the above - even though the documentation says otherwise.
enumFromN :: (Unbox a, Num a) => a -> Int -> Vector a
O(n) Yield a vector of the given length containing the values x, x+1
etc. This operation is usually more efficient than enumFromTo.
Update
#Carsten's comment made me curious - so I had a look into the sources for integer - well integer-simple to be precise, because for Integer there exists other versions integer-gmp and integer-gmp2 using libgmp.
data Integer = Positive !Positive | Negative !Positive | Naught
-------------------------------------------------------------------
-- The hard work is done on positive numbers
-- Least significant bit is first
-- Positive's have the property that they contain at least one Bit,
-- and their last Bit is One.
type Positive = Digits
type Positives = List Positive
data Digits = Some !Digit !Digits
| None
type Digit = Word#
data List a = Nil | Cons a (List a)
so when using Integer there is quite a bit of memory overhead compared to Int or rather unboxed Int# - I guess as this should be optimized, (though I have not confirmed that).
So Integer is (if I calculate correctly)
1 x Word for the sum-type-tag (here Positive
n x (Word + Word) for Some and the Digit part
1 x Word for the last None
a memory overhead of (2 + floor(log_10(n)) for each Integer in that calculation + a bit more for the accumulator.

Load Balancing Sort Algorithm

Im working on a program that sets the affinity for a process. I have pre-determined data that allowed me to calculate the rough percent of a CPU (or core) that a process uses during each of the three stages of the programs life. Every process has these same three stages, and I have pre-determined data for each process in each of these three stages. I am trying to determine the best algorithm that can sort the processes. The kicker is I cant sort each stage individually. For process X, all three stages have to be taken into account when being compared against process Y in the algorithm. As an example with some made up data:
CPU's currently at the following loads:
CPU | Stage 1 | Stage 2 | Stage 3
---------------------------------
1 | 25% | 25% | 25%
2 | 50% | 50% | 50%
3 | 75% | 25% | 75%
4 | 50% | 25% | 10%
Process X was pre-determined to take up
10% in stage 1, 20% in stage 2, and 30% in stage 3.
What I have come up with so far is to add the pre-determined percent that process X takes up to each CPU, which would result in this:
CPU | Stage 1 | Stage 2 | Stage 3
---------------------------------
1 | 35% | 45% | 55%
2 | 60% | 70% | 80%
3 | 85% | 45% | 105%
4 | 60% | 45% | 40%
and rank each CPU's stage against the other (giving ties the same value), which would result in this:
CPU | Stage 1 | Stage 2 | Stage 3
---------------------------------
1 | Rank 1 | Rank 1 | Rank 2
2 | Rank 2 | Rank 2 | Rank 3
3 | Rank 3 | Rank 1 | Rank 4
4 | Rank 2 | Rank 1 | Rank 1
and then weight the rankings by the how much each process uses at each stage, and adding the final rank * weights across each stage to get a integer to determine which CPU assignment is best. In this example I would give stage 3, a weight of 3 because it is the highest value stage for this process, stage 2 a weight of 2 and stage 1 a weight of 1 for the same reason as stage 3. This would result in:
CPU | Stage 1 | Stage 2 | Stage 3 | Sum
-----------------------------------------
1 | 1 | 2 | 6 | 9
2 | 2 | 4 | 9 | 15
3 | 3 | 2 | 12 | 17
4 | 2 | 2 | 3 | 7
Since CPU 4 has the lowest sum, it is therefore the best canidate to assign Process X to. There still are a few kinks in this I believe, and I think there could be a much better way of doing it (which is why I am asking you!). I just thought I would explain what I have so far, just to give you an idea of what I am working with.
Edit: I should add that you cant simply sum the stages for each CPU and then apply a sorting algorithm. Each stage must stay under 100%, and if you sum the stages, you could inadvertently assign a process to a CPU that does not have room for it. IE, assigning process Y with 90%/20%/30% was calculated (under the assumption of summing the stages) to be assigned to CPU 1 with 20%/30%/40%. The sum of the stages for this CPU could be less then any other CPU, but adding stage 1 of process Y (90%) to stage 1 of CPU 1 (20%) is greather then 100%, and would result in an overrun.
Summing the stages should be avoided anywhere because it hides possible problems.
What I believe this really boils down to is... How do you sort data sets? Since each CPU is essentially a data set (stage 1, stage 2, stage 3) that I need to sort in order to determine the process assignment.
Edit 2: I just ended up going with my description here.
So you are saying you want to sort PROCESSES so that you can schedule as many of them as possible to run under the current balance load of CPUs?
This is just like a 01-knapsack problem, except there are three dimensions (stages) instead of two (size, weight). I suppose the solutions for Knapsack (dynamic programming or greedy) will also work for you.

Math algorithm question

I'm not sure if this can be done without some determining factor....but wanted to see if someone knew of a way to do this.
I want to create a shifting scale for numbers.
Let's say I have the number 26000. I want the outcome of this algorithm to be 6500; or 25% of the original number. But if I have the number 5000, I want the outcome to be 2500; or 50% of the original number.
The percentages don't have to be exact, this is just an example.
I just want to have like a sine wave sort of thing. As the input number gets higher, the output number is a lower percentage of the input.
Does that make sense?
Plot some points in Excel and use the "show formula" option on the line.
Something like f(x) = x / log x?
x | f(x)
=======================
26000 | 5889 (22.6 %)
5000 | 1351 (27.2 %)
100000 | 20000 (20 %)
1000000 | 166666 (16.6 %)
Just a simple example. You can tweak it by playing with the base of the logarithm, by adding multiplicative constants on the numerator (x) or denominator (log x), by using square roots, squaring (or taking the root of) log x or x etc.
Here's what f(x) = 2*log(x)^2*sqrt(x) gives:
x | f(x)
=======================
26000 | 6285 (24 %)
5000 | 1934 (38 %)
500 | 325 (65 %)
100 | 80 (80 %)
1000000 | 72000 (7.2 %)
100000 | 15811 (15 %)
A suitable logarithmic scale might help.
It may be possible to define the function you want exactly if you specify a third transformation in addition to the two you've already mentioned. If you have some specific aim in mind, it's quite likely to fit a well-known mathematical definition which at least one poster could identify for you. It does sound as though you're talking about a logarithmic function. You'll have to be more specific about your requirements to define a useful algorithm however.
I'd suggest you play with the power law family of functions, c*x^a == c * pow(x,a) where a is the power. If you want an exact fraction of your answer, you would choose a=1 and it would just be a constant fraction. But you want the percentage to slowly decrease, so you could choose a<1. For example, we might choose a = 0.9 and c = 0.2 and get
1 0.2
10 1.59
100 12.6
1000 100.2
So it ranges from 20% at 1 to 10% at 1000. You can pick smaller a to make the fraction decrease more rapidly. (And you can scale everything to fit your range.)
In particular, if c*5000^a = 2500 and c*26000^a = 6500, then by dividing we get (5.1)^a = 2.6 which we can solve as a = log(2.6)/log(5.1) = 0.58648.... Then we plug back in to get c*147.69 = 2500 so c = 16.927...
Now the progression goes like so
1000 973
3000 1853
5000 2500
10000 3754
15000 4762
26000 6574
50000 9648
90000 13618
This is somewhat similar to simple compression schemes used for analogue audio. See Wikipedia entry for Companding.

Resources