Can I do two sums in parallel with collection functions, in F#, or generally, optimize the whole thing - performance

I have the following code:
// volume queue
let volumeQueue = Queue<float>()
let queueSize = 10 * 500 // 10 events per second, 500 seconds max
// add a signed volume to the queue
let addToVolumeQueue x =
volumeQueue.Enqueue(x)
while volumeQueue.Count > queueSize do volumeQueue.TryDequeue() |> ignore
// calculate the direction of the queue, normalized between +1 (buy) and -1 (sell)
let queueDirection length =
let subQueue =
volumeQueue
|> Seq.skip (queueSize - length)
let boughtVolume =
subQueue
|> Seq.filter (fun l -> l > 0.)
|> Seq.sum
let totalVolume =
subQueue
|> Seq.sumBy (fun l -> abs l)
2. * boughtVolume / totalVolume - 1.
What this does is run a fixed length queue to which transaction volumes are added, some positive, some negative.
And then it calculates the cumulative ratio of positive over negative entries and normalizes it between +1 and -1, with 0 meaning the sums are half / half.
There is no optimization right now but this code's performance will matter. So I'd like to make it fast, without compromising readability (it's called roughly every 100ms).
The first thing that comes to mind is to do the two sums at once (the positive numbers and all the numbers) in a single loop. It can easily be done in a for loop, but can it be done with collection functions?
The next option I was thinking about is to get rid of the queue and use a circular buffer, but since the code is run on a part of the buffer (the last 'length' items), I'd have to handle the wrap around part; I guess I could extend the buffer to the size of a power of 2 and get automatic wrap around that way.
Any idea is welcome, but my first original question is: can I do the two sums in a single pass with the collection functions? I can't iterate in the queue with an indexer, so I can't use a for loop (or I guess I'd have to instance an iterator)

First of all, there is nothing inherently wrong with using mutable variables and loops in F#. Especially at a small scale (e.g. inside a function), this can often be quite readable - or at least, easy to understand if there is a suitable comment.
To do this using a single iteration, you could use fold. This basically calculates the two sums in a single iteration at the cost of some readability:
let queueDirectionFold length =
let boughtVolume, totalVolume =
volumeQueue
|> Seq.skip (queueSize - length)
|> Seq.fold (fun (bv, tv) v ->
(if v > 0.0 then bv else bv + v), tv + abs v) (0.0, 0.0)
2. * boughtVolume / totalVolume - 1.
As I mentioned earlier, I would also consider using a loop. The loop itself is quite simple, but some complexity is added by the fact that you need to skip some elements. Still, I think it's quite clear:
let queueDirectionLoop length =
let mutable i = 0
let mutable boughtVolume = 0.
let mutable totalVolume = 0.
for v in volumeQueue do
if i >= queueSize - length then
totalVolume <- totalVolume + abs v
if v > 0. then boughtVolume <- boughtVolume + v
i <- i + 1
2. * boughtVolume / totalVolume - 1.
I tested the performance using 4000 elements and here is what I got:
#time
let rnd = System.Random()
for i in 0 .. 4000 do volumeQueue.Enqueue(rnd.NextDouble())
for i in 0 .. 10000 do ignore(queueDirection 1000) // ~900 ms
for i in 0 .. 10000 do ignore(queueDirectionFold 1000) // ~460 ms
for i in 0 .. 10000 do ignore(queueDirectionLoop 1000) // ~370 ms
Iterating over the queue just once definitely helps with performance. Doing this in an imperative loop helps the performance even more - this may be worth it if you care about performance. The code may be a bit less readable than the original, but I think it's not much worse than fold.

Related

Haskell: map length . group is way slower than explicit recursion?

Consider this trivial algorithm of prime-decomposition of an integer n: Let d' be the divisor of n last found. Initially, set d'=1. Find the smallest divisor d>d' of n, and find the maximal value e such that de divides n. Append de to the answer and repeat the procedure on n/de. Finally, stop when n becomes 1. For simplicity, let's ignore mathematical optimizations, like stop at sqrt n etc.
I have implemented it in two ways. The first one generates a list of division "attempts", and then groups the successful ones by divisors. For example, for n=20, we first generate [(2,20),(2,10),(2,5),(3,5),(4,5),(5,5),(5,1)], which we then transform to the desired [(2,2),(5,1)] using group and other library functions.
The second implementation is an explicit recursion which keeps track of the exponent e along the way, appends de to the answer once the maximal e is reached, proceeds to finding the "next" d, and so on.
Question 1: Why does the first implementation run way slower than the second, despite the following:
Both the implementations execute div, the core step of the algorithm, roughly the same number of times.
Lazy evaluation (and fusion?) has the effect that the long list illustrated above never has to be materialized in the first place. As you can see in the code below, divTrials n, the list I am talking about, is transformed by a chain of higher order functions. In that, I think that the part map (\xs-> (head xs,length xs)) ... group should tell the compiler that the list is just intermediate:
{-# OPTIONS_GHC -O2 #-}
module GroupCheck where
import Data.List
import Data.Maybe
implement1 :: Integral t=> t -> [(t,Int)] -- IMPLEMENTATION 1
implement1 = map (\xs-> (head xs,length xs)).factorGroups where
tryDiv (d,n)
| n `mod` d == 0 = (d,n `div` d)
| n == 1 = (1,1) -- hack
| otherwise = (d+1,n)
divTrials n = takeWhile (/=(1,1)) $ (2,n): map tryDiv (divTrials n)
factorGroups = filter (not.null).map tail.group.map fst.divTrials
implement2 :: Show t => Integral t => t -> [(t,Int)] -- IMPLEMENTATION 2
implement2 num = keep2 $ tail $ go (1,0,1,num) where
range d n = [d+1..n]
nextd d n = fromMaybe n $ find ((0==).(n`mod`)) (range d n)
update (d,e,de,n)
| n `mod` d == 0 = update (d,e+1,de*d,n`div`d)
| otherwise = (d,e,de,n)
go (d,e,de,1) = [(d,e,de,1)]
go (d,e,de,n) = (d,e,de,n) : go (update (nextd d n,0,1,n))
keep2 = map (\(d,e,_,_)->(d,e))
main :: IO ()
main = do
let n = 293872
let ans1 = implement1 n
let ans2 = implement2 n
print ans1
print ans2
Profiling tells us that tryDiv and divTrials together eat up >99% of the entire execution time:
> stack ghc -- -main-is GroupCheck.main -prof -fprof-auto -rtsopts GroupCheck
> ./GroupCheck +RTS -p >/dev/null && cat GroupCheck.prof
GroupCheck +RTS -p -RTS
total time = 18.34 secs (18338 ticks # 1000 us, 1 processor)
total alloc = 17,561,404,568 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
implement1.divTrials GroupCheck GroupCheck.hs:12:3-69 52.6 69.2
implement1.tryDiv GroupCheck GroupCheck.hs:(8,3)-(11,25) 47.2 30.8
Question 1.5: So.. what's so bad about these functions? Also,
Question 2: In a more general case of having to aggregate contiguous blocks of identical elements from a nondecreasing sequence, should we go the bulky implement2 way if we want speed? (Again, ignoring domain-specific optimizations.)
Or did I totally miss something obvious? Thanks!
Just to establish a baseline, I ran your program on a slightly larger starting number (so that time didn't print out 0.00s). I chose n = 2938722345623 for no particular reason. Here's the timings before starting to tweak things:
ans1: indistinguishable from infinity (I finished writing this entire answer and it was still running, about 26 minutes in total)
ans2: 2.78s
The first thing to try is to tweak this line:
divTrials n = takeWhile (/=(1,1)) $ (2,n): map tryDiv (divTrials n)
This looks like a pretty natural definition, but it turns out that GHC never memoizes function calls. So if you want to make a list that's defined recursively in terms of itself, you must not make a function call in the recursion. Here's how:
divTrials n = xs where xs = takeWhile (/=(1,1)) $ (2,n): map tryDiv xs
Just that change brings the time down to 7.85s. Still off by a factor of about 3, but much much better.
The less obvious problem lies here:
factorGroups = filter (not.null).map tail.group.map fst.divTrials
Putting the group so early breaks fusion, causing that intermediate list to actually be materialized. This means allocating and deallocating a lot of cons cells and tuples. Here's an implementation that has the same spirit, but puts more work before the group:
tryDiv d n
| n `mod` d == 0 = d : tryDiv d (n `div` d)
| n == 1 = []
| otherwise = tryDiv (d+1) n
factorGroups = group . tryDiv 2
With that, we are down to 2.65s -- slightly faster than ans2, though I only did one test of each so it's pretty likely to just be measurement noise.

Implementing Memoization efficiently on nonintegral keys

I am new to Haskell and have been practicing by doing some simple programming challenges. The last 2 days, I've been trying to implement the unbounded knapsack problem here. The algorithm I'm using is described on the wikipedia page, though for this problem the word 'weight' is replaced with the word 'length'. Anyways, I started by writing the code without memoization:
maxValue :: [(Int,Int)] -> Int -> Int
maxValue [] len = 0
maxValue ((l, val): other) len =
if l > len then
skipValue
else
max skipValue takeValue
where skipValue = maxValue other len
takeValue = (val + maxValue ([(l, val)] ++ other) (len - l)
I had hoped that haskell would be nice and have some nice syntax like #pragma memoize to help me, but looking around for examples, the solution was explained with this fibonacci problem code.
memoized_fib :: Int -> Integer
memoized_fib = (map fib [0 ..] !!)
where fib 0 = 0
fib 1 = 1
fib n = memoized_fib (n-2) + memoized_fib (n-1)
After grasping the concept behind this example, I was very disappointed - the method used is super hacky and only works if 1) the input to the function is a single integer, and 2) the function needs to compute the values recursively in the order f(0), f(1), f(2), ... But what if my parameters are vectors or sets? And if I want to memoize a function like f(n) = f(n/2) + f(n/3), I need to compute the value of f(i) for all i less than n, when I don't need most of those values. (Others have pointed out this claim is false)
I tried implementing what I wanted by passing a memo table that we slowly fill up as an extra parameter:
maxValue :: (Map.Map (Int, Int) Int) -> [(Int,Int)] -> Int -> (Map.Map (Int, Int) Int, Int)
maxValue m [] len = (m, 0)
maxValue m ((l, val) : other) len =
if l > len then
(mapWithSkip, skipValue)
else
(mapUnion, max skipValue (takeValue+val))
where (skipMap, skipValue) = maxValue m other len
mapWithSkip = Map.insertWith' max (1 + length other, len) skipValue skipMap
(takeMap, takeValue) = maxValue m ([(l, val)] ++ other) (len - l)
mapWithTake = Map.insertWith' max (1 + length other, len) (takeValue+val) mapWithSkip
mapUnion = Map.union mapWithSkip mapWithTake
But this is too slow, I believe because Map.union takes too long, it's O(n+m) rather than O(min(n,m)). Furthermore, this code seems a quite messy for something as simple as memoizaton. For this specific problem, you might be able to get away with generalizing the hacky approach to 2 dimensions, and computing a bit extra, but I want to know how to do memoization in a more general sense. How can I implement memoization in this more general form while maintaining the same complexity as the code would have in imperative languages?
And if I want to memoize a function like f(n) = f(n/2) + f(n/3), I need to compute the value of f(i) for all i less than n, when I don't need most of those values.
No, laziness means that values that are not used never get computed. You allocate a thunk for them in case they are ever used, so it's a nonzero amount of CPU and RAM dedicated to this unused value, but e.g. evaluating f 6 never causes f 5 to be evaluated. So presuming that the expense of calculating an item is much higher than the expense of allocating a cons cell, and that you end up looking at a large percentage of the total possible values, the wasted work this method uses is small.
But what if my parameters are vectors or sets?
Use the same technique, but with a different data structure than a list. A map is the most general approach, provided that your keys are Ord and also that you can enumerate all the keys you will ever need to look up.
If you can't enumerate all the keys, or you plan to look up many fewer keys than the total number possible, then you can use State (or ST) to simulate the imperative process of sharing a writable memoization cache between invocations of your function.
I would have liked to show you how this works, but I find your problem statement / links confusing. The exercise you link to does seem to be equivalent to the UKP in the Wikipedia article you link to, but I don't see anything in that article that looks like your implementation. The "Dynamic programming in-advance algorithm" Wikipedia gives is explicitly designed to have the exact same properties as the fib memoization example you gave. The key is a single Int, and the array is built from left to right: starting with len=0 as the base case, and basing all other computations on already-computed values. It also, for some reason I don't understand, seems to assume you will have at least 1 copy of each legal-sized object, rather than at least 0; but that is easily fixed if you have different constraints.
What you've implemented is totally different, starting from the total len, and choosing for each (length, value) step how many pieces of size length to cut up, then recursing with a smaller len and removing the front item from your list of weight-values. It's closer to the traditional "how many ways can you make change for an amount of currency given these denominations" problem. That, too, is amenable to the same left-to-right memoization approach as fib, but in two dimensions (one dimension for amount of currency to make change for, and another for number of denominations remaining to be used).
My go-to way to do memoization in Haskell is usually MemoTrie. It's pretty straightforward, it's pure, and it usually does what I'm looking for.
Without thinking too hard, you could produce:
import Data.MemoTrie (memo2)
maxValue :: [(Int,Int)] -> Int -> Int
maxValue = memo2 go
where
go [] len = 0
go lst#((l, val):other) len =
if l > len then skipValue else max skipValue takeValue
where
skipValue = maxValue other len
takeValue = val + maxValue lst (len - l)
I don't have your inputs, so I don't know how fast this will go — it's a little strange to memoize the [(Int,Int)] input. I think you recognize this too because in your own attempt, you actually memoize over the length of the list, not the list itself. If you want to do that, it makes sense to convert your list to a constant-time-lookup array and then memoize. This is what I came up with:
import qualified GHC.Arr as Arr
maxValue :: [(Int,Int)] -> Int -> Int
maxValue lst = memo2 go 0
where
values = Arr.listArray (0, length lst - 1) lst
go i _ | i >= length lst = 0
go i len = if l > len then skipValue else max skipValue takeValue
where
(l, val) = values Arr.! i
skipValue = go (i+1) len
takeValue = val + go i (len - l)
General, run-of-the-mill memoization in Haskell can be implemented the same way it is in other languages, by closing a memoized version of the function over a mutable map that caches the values. If you want the convenience of running the function as if it was pure, you'll need to maintain the state in IO and use unsafePerformIO.
The following memoizer will probably be sufficient for most code submission websites, as it depends only on System.IO.Unsafe, Data.IORef, and Data.Map.Strict, which should usually be available.
import qualified Data.Map.Strict as Map
import System.IO.Unsafe
import Data.IORef
memo :: (Ord k) => (k -> v) -> (k -> v)
memo f = unsafePerformIO $ do
m <- newIORef Map.empty
return $ \k -> unsafePerformIO $ do
mv <- Map.lookup k <$> readIORef m
case mv of
Just v -> return v
Nothing -> do
let v = f k
v `seq` modifyIORef' m $ Map.insert k v
return v
From your question and comments, you seem to be the sort of person who's perpetually disappointed (!), so perhaps the use of unsafePerformIO will disappoint you, but if GHC actually provided a memoization pragma, this is probably what it would be doing under the hood.
For an example of straightforward use:
fib :: Int -> Int
fib = memo fib'
where fib' 0 = 0
fib' 1 = 1
fib' n = fib (n-1) + fib (n-2)
main = do
print $ fib 100000
or more to the point (SPOILERS?!), a version of your maxValue memoized in the length only:
maxValue :: [(Int,Int)] -> Int -> Int
maxValue values = go
where go = memo (go' values)
go' [] len = 0
go' ((l, val): other) len =
if l > len then
skipValue
else
max skipValue takeValue
where skipValue = go' other len
takeValue = val + go (len - l)
This does a little more work than necessary, since the takeValue case re-evaluates the full set of marketable pieces, but it was fast enough to pass all the test cases on the linked web page. If it wasn't fast enough, then you'd need a memoizer that memoizes a function with results shared across calls with non-identical arguments (same length, but different marketable pieces, where you know the answer is going to be the same anyway because of special aspects of the problem and the order in which you check different marketable pieces and lengths). This would be a non-standard memoization, but it wouldn't be hard to modify the memo function to handle this case, I don't think, simply by splitting the argument up into a "key" argument and a "non-key" argument, or deriving the key from the argument via an arbitrary function supplied at memoization time.

Perform numpy.sum (or scipy.integrate.simps()) on large splitted array efficiently

Let's consider a very large numpy array a (M, N).
where M can typically be 1 or 100 and N 10-100,000,000
We have the array of indices that can split it into many (K = 1,000,000) along axis=1.
We want to efficiently perform an operation like integration along axis=1 (np.sum to take the simplest form) on each sub-array and return a (M, K) array.
An elegant and efficient solution was proposed by #Divakar in question [41920367]how to split numpy array and perform certain actions on split arrays [Python] but my understanding is that it only applies to cases where all sub-arrays have the same shape, which allows for reshaping.
But in our case the sub-arrays don't have the same shape, which, so far has forced me to loop on the index... please take me out of my misery...
Example
a = np.random.random((10, 100000000))
ind = np.sort(np.random.randint(10, 9000000, 1000000))
The size of the sub-arrays are not homogenous:
sizes = np.diff(ind)
print(sizes.min(), size.max())
2, 8732
So far, the best I found is:
output = np.concatenate([np.sum(vv, axis=1)[:, None] for vv in np.split(a, ind, axis=1)], axis=1)
Possible feature request for numpy and scipy:
If looping is really unavoidable, at least having it done in C inside the numpy and scipy.integrate.simps (or romb) functions would probably speed-up the output.
Something like
output = np.sum(a, axis=1, split_ind=ind)
output = scipy.integrate.simps(a, x=x, axis=1, split_ind=ind)
output = scipy.integrate.romb(a, x=x, axis=1, split_ind=ind)
would be very welcome !
(where x itself could be splitable, or not)
Side note:
While trying this example, I noticed that with these numbers there was almost always an element of sizes equal to 0 (the sizes.min() is almost always zero).
This looks peculiar to me, as we are picking 10,000 integers between 10 and 9,000,000, the odds that the same number comes up twice (such that diff = 0) should be close to 0. It seems to be very close to 1.
Would that be due to the algorithm behind np.random.randint ?
What you want is np.add.reduceat
output = np.add.reduceat(a, ind, axis = 1)
output.shape
Out[]: (10, 1000000)
Universal Functions (ufunc) are a very powerful tool in numpy
As for the repeated indices, that's simply the Birthday Problem cropping up.
Great !
Thanks ! on my VM Cent OS 6.9 I have the following results:
In [71]: a = np.random.random((10, 10000000))
In [72]: ind = np.unique(np.random.randint(10, 9000000, 100000))
In [73]: ind2 = np.append([0], ind)
In [74]: out = np.concatenate([np.sum(vv, axis=1)[:, None] for vv in np.split(a, ind, axis=1)], axis=1)
In [75]: out2 = np.add.reduceat(a, ind2, axis=1)
In [83]: np.allclose(out, out2)
Out[83]: True
In [84]: %timeit out = np.concatenate([np.sum(vv, axis=1)[:, None] for vv in np.split(a, ind, axis=1)], axis=1)
2.7 s ± 40.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [85]: %timeit out2 = np.add.reduceat(a, ind2, axis=1)
179 ms ± 15.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
That's a good 93 % speed gain (or factor 15 faster) over the list concatenation :-)
Great !

How Random module get tested in OCaml?

OCaml has a Random module, I am wondering how it tests itself for randomness. However, i don't have a clue what exactly they are doing. I understand it tries to test for chi-square with two more dependencies tests. Here are the code for the testing part:
chi-square test
(* Return the sum of the squares of v[i0,i1[ *)
let rec sumsq v i0 i1 =
if i0 >= i1 then 0.0
else if i1 = i0 + 1 then Pervasives.float v.(i0) *. Pervasives.float v.(i0)
else sumsq v i0 ((i0+i1)/2) +. sumsq v ((i0+i1)/2) i1
;;
let chisquare g n r =
if n <= 10 * r then invalid_arg "chisquare";
let f = Array.make r 0 in
for i = 1 to n do
let t = g r in
f.(t) <- f.(t) + 1
done;
let t = sumsq f 0 r
and r = Pervasives.float r
and n = Pervasives.float n in
let sr = 2.0 *. sqrt r in
(r -. sr, (r *. t /. n) -. n, r +. sr)
;;
Q1:, why they write sum of squares like that?
It seems it is just summing up all squares. Why not write like:
let rec sumsq v i0 i1 =
if i0 >= i1 then 0.0
else Pervasives.float v.(i0) *. Pervasives.float v.(i0) + (sumsq v (i0+1) i1)
Q2:, why they seem to use different way for chisquare?
From the chi squared test wiki, they formula is
But it seems they are using different formula, what's behind the scene?
Other two dependencies tests
(* This is to test for linear dependencies between successive random numbers.
*)
let st = ref 0;;
let init_diff r = st := int r;;
let diff r =
let x1 = !st
and x2 = int r
in
st := x2;
if x1 >= x2 then
x1 - x2
else
r + x1 - x2
;;
let st1 = ref 0
and st2 = ref 0
;;
(* This is to test for quadratic dependencies between successive random
numbers.
*)
let init_diff2 r = st1 := int r; st2 := int r;;
let diff2 r =
let x1 = !st1
and x2 = !st2
and x3 = int r
in
st1 := x2;
st2 := x3;
(x3 - x2 - x2 + x1 + 2*r) mod r
;;
Q3: I don't really know these two tests, can someone en-light me?
Q1:
It's a question of memory usage. You will notice that for large arrays, your implementation of sumsq will fail with "Stack overflow during evaluation" (on my laptop, it fails for r = 200000). This is because before adding Pervasives.float v.(i0) *. Pervasives.float v.(i0) to (sumsq v (i0+1) i1), you have to compute the latter. So it's not until you have computed the result of the last call of sumsq that you can start "going up the stack" and adding everything. Clearly, sumsq is going to be called r times in your case, so you will have to keep track of r calls.
By contrast, with their approach they only have to keep track of log(r) calls because once sumsq has been computed for half the array, you only need to the result of the corresponding call (you can forget about all the other calls that you had to do to compute that).
However, there are other ways of achieving this result and I'm not sure why they chose this one (maybe somebody will be able to tell ?). If you want to know more on the problems linked to recursion and memory, you should probably check the wikipedia article on tail-recursion. If you want to know more on the technique that they used here, you should check the wikipedia article on divide and conquer algorithms -- be careful though, because here we are talking about memory and the Wikipedia article will probably talk a lot about temporal complexity (speed).
Q2:
You should look more closely at both expressions. Here, all the E_i's are equal to n/r. If you replace this in the expression you gave, you will find the same expression that they use: (r *. t /. n) -. n. I didn't check about the values of the bounds though, but since you have a Chi-squared distribution with parameter r-minus-one-or-two degrees of freedom, and r quite large, it's not surprising to see them use this kind of confidence interval. The Wikipedia article you mentionned should help you figure out what confidence interval they use exactly fairly easily.
Good luck!
Edit: Oops, I forgot about Q3. I don't know these tests either, but I'm sure you should be able to find more about them by googling something like "linear dependency between consecutive numbers" or something. =)
Edit 2: In reply to Jackson Tale's June 29 question about the confidence interval:
They should indeed test it against the Chi-squared distribution -- or, rather, use the Chi-squared distribution to find a confidence interval. However, because of the central limit theorem, the Chi-squared distribution with k degrees of freedom converges to a normal law with mean k and variance 2k. A classical result is that the 95% confidence interval for the normal law is approximately [μ - 1.96 σ, μ + 1.96 σ], where μ is the mean and σ the standard deviation -- so that's roughly the mean ± twice the standard deviation. Here, the number of degrees of freedom is (I think) r - 1 ~ r (because r is large) so that's why I said I wasn't surprised by a confidence interval of the form [r - 2 sqrt(r), r + 2 sqrt(r)]. Nevertheless, now that I think about it I can't see why they don't use ± 2 sqrt(2 r)... But I might have missed something. And anyway, even if I was correct, since sqrt(2) > 1, they get a more stringent confidence interval, so I guess that's not really a problem. But they should document what they're doing a bit more... I mean, the tests that they're using are probably pretty standard so most likely most people reading their code will know what they're doing, but still...
Also, you should note that, as is often the case, this kind of test is not conclusive: generally, you want to show that something has some kind of effect. So you formulate two hypothesis : the null hypothesis, "there is no effect", and the alternative hypothesis, "there is an effect". Then, you show that, given your data, the probability that the null hypothesis holds is very low. So you conclude that the alternative hypothesis is (most likely) true -- i.e. that there is some kind of effect. This is conclusive. Here, what we would like to show is that the random number generator is good. So we don't want to show that the numbers it produces differ from some law, but that they conform to it. The only way to do that is to perform as many tests as possible showing that the number produced have the same property as randomly generated ones. But the only conclusion we can draw is "we were not able to find a difference between the actual data and what we would have observed, had they really been randomly generated". But this is not a lack of rigor from the OCaml developers: people always do that (eg, a lot of tests require, say, the normality. So before performing these tests, you try to find a test which would show that your variable is not normally distributed. And when you can't find any, you say "Oh well, the normality of this variable is probably sufficient for my subsequent tests to hold") -- simply because there is no other way to do it...
Anyway, I'm no statistician and the considerations above are simply my two cents, so you should be careful. For instance, I'm sure there is a better reason why they're using this particular confidence interval. I also think you should be able to figure it out if you write everything down carefully to make sure about what they're doing exactly.

F# Efficiently removing n items from the end of a Set

I know I can remove the last element from a set:
s.Remove(s.MaximumElement)
But if I want to remove the n maximum elements... do I just execute the above n times, or is there a faster way to do that?
To be clear, this is an obvious solution:
let rec removeLastN (s : Set<'a>, num : int) : Set<'a> =
match num with
| 0 -> s
| _ -> removeLast(s.Remove(s.MinimumElement), num-1)
But it involves creating a new set n times. Is there a way to do it and only create a new set once?
But it involves creating a new set n
times. Is there a way to do it and
only create a new set once?
To the best of my knowledge, no. I'd say what you have a perfectly fine implementation, it runs in O(lg n) -- and its concise too :) Most heap implementations give you O(lg n) for delete min anyway, so what you have is about as good as you can get it.
You might be able to get a little better speed by rolling your balanced tree, and implementing a function to drop a left or right branch for all values greater than a certain value. I don't think an AVL tree or RB tree are appropriate in this context, since you can't really maintain their invariants, but a randomlized tree will give you the results you want.
A treap works awesome for this, because it uses randomization rather than tree invariants to keep itself relatively balanced. Unlike an AVL tree or a RB-tree, you can split a treap on a node without worrying about it being unbalanced. Here's a treap implementation I wrote a few months ago:
http://pastebin.com/j0aV3DJQ
I've added a split function, which will allows you take a tree and return two trees containing all values less than and all values greater than a given value. split runs in O(lg n) using a single pass through the tree, so you can prune entire branches of your tree in one shot -- provided that you know which value to split on.
But if I want to remove the n maximum
elements... do I just execute the
above n times, or is there a faster
way to do that?
Using my Treap class:
open Treap
let nthLargest n t = Seq.nth n (Treap.toSeqBack t)
let removeTopN n t =
let largest = nthLargest n t
let smallerValues, wasFound, largerValues = t.Split(largest)
smallerValues
let e = Treap.empty(fun (x : int) (y : int) -> x.CompareTo(y))
let t = [1 .. 100] |> Seq.fold (fun (acc : Treap<_>) x -> acc.Insert(x)) e
let t' = removeTopN 10 t
removeTopN runs in O(n + lg m) time, where n is the index into the tree sequence and m is the number of items in the tree.
I make no guarantees about the accuracy of my code, use at your own peril ;)
In F#, you can use Set.partition or Set.filter to create sub sets:
let s = Set([1;4;6;9;100;77])
let a, b = Set.partition (fun x -> x <= 10) s
let smallThan10 = Set.filter (fun x -> x < 10) s
In your question, maybe you don't know the value of the ith number of your set, so here is a handy function for that:
let nth (n:int) (s:'a Set) =
s |> Set.toSeq |> Seq.nth n
Now, we can write the remove-top-n function:
let removeTopN n (s:'a Set) =
let size = s.Count
let m = size - n
let mvalue = nth m s
Set.filter (fun x -> x < mvalue) s
and test it:
removeTopN 3 s
and we get:
val it : Set<int> = set [1; 4; 6]
Notice that the removeTopN does not work for a set containing multiple same values.
That is already a pretty good solution. OCaml has a split function that can split a Set so you can find the right element then you can split the Set to remove a bunch of elements at a time. Alternatively, you can use Set.difference to extract another Set of elements.

Resources