Implementing Memoization efficiently on nonintegral keys - performance

I am new to Haskell and have been practicing by doing some simple programming challenges. The last 2 days, I've been trying to implement the unbounded knapsack problem here. The algorithm I'm using is described on the wikipedia page, though for this problem the word 'weight' is replaced with the word 'length'. Anyways, I started by writing the code without memoization:
maxValue :: [(Int,Int)] -> Int -> Int
maxValue [] len = 0
maxValue ((l, val): other) len =
if l > len then
skipValue
else
max skipValue takeValue
where skipValue = maxValue other len
takeValue = (val + maxValue ([(l, val)] ++ other) (len - l)
I had hoped that haskell would be nice and have some nice syntax like #pragma memoize to help me, but looking around for examples, the solution was explained with this fibonacci problem code.
memoized_fib :: Int -> Integer
memoized_fib = (map fib [0 ..] !!)
where fib 0 = 0
fib 1 = 1
fib n = memoized_fib (n-2) + memoized_fib (n-1)
After grasping the concept behind this example, I was very disappointed - the method used is super hacky and only works if 1) the input to the function is a single integer, and 2) the function needs to compute the values recursively in the order f(0), f(1), f(2), ... But what if my parameters are vectors or sets? And if I want to memoize a function like f(n) = f(n/2) + f(n/3), I need to compute the value of f(i) for all i less than n, when I don't need most of those values. (Others have pointed out this claim is false)
I tried implementing what I wanted by passing a memo table that we slowly fill up as an extra parameter:
maxValue :: (Map.Map (Int, Int) Int) -> [(Int,Int)] -> Int -> (Map.Map (Int, Int) Int, Int)
maxValue m [] len = (m, 0)
maxValue m ((l, val) : other) len =
if l > len then
(mapWithSkip, skipValue)
else
(mapUnion, max skipValue (takeValue+val))
where (skipMap, skipValue) = maxValue m other len
mapWithSkip = Map.insertWith' max (1 + length other, len) skipValue skipMap
(takeMap, takeValue) = maxValue m ([(l, val)] ++ other) (len - l)
mapWithTake = Map.insertWith' max (1 + length other, len) (takeValue+val) mapWithSkip
mapUnion = Map.union mapWithSkip mapWithTake
But this is too slow, I believe because Map.union takes too long, it's O(n+m) rather than O(min(n,m)). Furthermore, this code seems a quite messy for something as simple as memoizaton. For this specific problem, you might be able to get away with generalizing the hacky approach to 2 dimensions, and computing a bit extra, but I want to know how to do memoization in a more general sense. How can I implement memoization in this more general form while maintaining the same complexity as the code would have in imperative languages?

And if I want to memoize a function like f(n) = f(n/2) + f(n/3), I need to compute the value of f(i) for all i less than n, when I don't need most of those values.
No, laziness means that values that are not used never get computed. You allocate a thunk for them in case they are ever used, so it's a nonzero amount of CPU and RAM dedicated to this unused value, but e.g. evaluating f 6 never causes f 5 to be evaluated. So presuming that the expense of calculating an item is much higher than the expense of allocating a cons cell, and that you end up looking at a large percentage of the total possible values, the wasted work this method uses is small.
But what if my parameters are vectors or sets?
Use the same technique, but with a different data structure than a list. A map is the most general approach, provided that your keys are Ord and also that you can enumerate all the keys you will ever need to look up.
If you can't enumerate all the keys, or you plan to look up many fewer keys than the total number possible, then you can use State (or ST) to simulate the imperative process of sharing a writable memoization cache between invocations of your function.
I would have liked to show you how this works, but I find your problem statement / links confusing. The exercise you link to does seem to be equivalent to the UKP in the Wikipedia article you link to, but I don't see anything in that article that looks like your implementation. The "Dynamic programming in-advance algorithm" Wikipedia gives is explicitly designed to have the exact same properties as the fib memoization example you gave. The key is a single Int, and the array is built from left to right: starting with len=0 as the base case, and basing all other computations on already-computed values. It also, for some reason I don't understand, seems to assume you will have at least 1 copy of each legal-sized object, rather than at least 0; but that is easily fixed if you have different constraints.
What you've implemented is totally different, starting from the total len, and choosing for each (length, value) step how many pieces of size length to cut up, then recursing with a smaller len and removing the front item from your list of weight-values. It's closer to the traditional "how many ways can you make change for an amount of currency given these denominations" problem. That, too, is amenable to the same left-to-right memoization approach as fib, but in two dimensions (one dimension for amount of currency to make change for, and another for number of denominations remaining to be used).

My go-to way to do memoization in Haskell is usually MemoTrie. It's pretty straightforward, it's pure, and it usually does what I'm looking for.
Without thinking too hard, you could produce:
import Data.MemoTrie (memo2)
maxValue :: [(Int,Int)] -> Int -> Int
maxValue = memo2 go
where
go [] len = 0
go lst#((l, val):other) len =
if l > len then skipValue else max skipValue takeValue
where
skipValue = maxValue other len
takeValue = val + maxValue lst (len - l)
I don't have your inputs, so I don't know how fast this will go — it's a little strange to memoize the [(Int,Int)] input. I think you recognize this too because in your own attempt, you actually memoize over the length of the list, not the list itself. If you want to do that, it makes sense to convert your list to a constant-time-lookup array and then memoize. This is what I came up with:
import qualified GHC.Arr as Arr
maxValue :: [(Int,Int)] -> Int -> Int
maxValue lst = memo2 go 0
where
values = Arr.listArray (0, length lst - 1) lst
go i _ | i >= length lst = 0
go i len = if l > len then skipValue else max skipValue takeValue
where
(l, val) = values Arr.! i
skipValue = go (i+1) len
takeValue = val + go i (len - l)

General, run-of-the-mill memoization in Haskell can be implemented the same way it is in other languages, by closing a memoized version of the function over a mutable map that caches the values. If you want the convenience of running the function as if it was pure, you'll need to maintain the state in IO and use unsafePerformIO.
The following memoizer will probably be sufficient for most code submission websites, as it depends only on System.IO.Unsafe, Data.IORef, and Data.Map.Strict, which should usually be available.
import qualified Data.Map.Strict as Map
import System.IO.Unsafe
import Data.IORef
memo :: (Ord k) => (k -> v) -> (k -> v)
memo f = unsafePerformIO $ do
m <- newIORef Map.empty
return $ \k -> unsafePerformIO $ do
mv <- Map.lookup k <$> readIORef m
case mv of
Just v -> return v
Nothing -> do
let v = f k
v `seq` modifyIORef' m $ Map.insert k v
return v
From your question and comments, you seem to be the sort of person who's perpetually disappointed (!), so perhaps the use of unsafePerformIO will disappoint you, but if GHC actually provided a memoization pragma, this is probably what it would be doing under the hood.
For an example of straightforward use:
fib :: Int -> Int
fib = memo fib'
where fib' 0 = 0
fib' 1 = 1
fib' n = fib (n-1) + fib (n-2)
main = do
print $ fib 100000
or more to the point (SPOILERS?!), a version of your maxValue memoized in the length only:
maxValue :: [(Int,Int)] -> Int -> Int
maxValue values = go
where go = memo (go' values)
go' [] len = 0
go' ((l, val): other) len =
if l > len then
skipValue
else
max skipValue takeValue
where skipValue = go' other len
takeValue = val + go (len - l)
This does a little more work than necessary, since the takeValue case re-evaluates the full set of marketable pieces, but it was fast enough to pass all the test cases on the linked web page. If it wasn't fast enough, then you'd need a memoizer that memoizes a function with results shared across calls with non-identical arguments (same length, but different marketable pieces, where you know the answer is going to be the same anyway because of special aspects of the problem and the order in which you check different marketable pieces and lengths). This would be a non-standard memoization, but it wouldn't be hard to modify the memo function to handle this case, I don't think, simply by splitting the argument up into a "key" argument and a "non-key" argument, or deriving the key from the argument via an arbitrary function supplied at memoization time.

Related

Can I do two sums in parallel with collection functions, in F#, or generally, optimize the whole thing

I have the following code:
// volume queue
let volumeQueue = Queue<float>()
let queueSize = 10 * 500 // 10 events per second, 500 seconds max
// add a signed volume to the queue
let addToVolumeQueue x =
volumeQueue.Enqueue(x)
while volumeQueue.Count > queueSize do volumeQueue.TryDequeue() |> ignore
// calculate the direction of the queue, normalized between +1 (buy) and -1 (sell)
let queueDirection length =
let subQueue =
volumeQueue
|> Seq.skip (queueSize - length)
let boughtVolume =
subQueue
|> Seq.filter (fun l -> l > 0.)
|> Seq.sum
let totalVolume =
subQueue
|> Seq.sumBy (fun l -> abs l)
2. * boughtVolume / totalVolume - 1.
What this does is run a fixed length queue to which transaction volumes are added, some positive, some negative.
And then it calculates the cumulative ratio of positive over negative entries and normalizes it between +1 and -1, with 0 meaning the sums are half / half.
There is no optimization right now but this code's performance will matter. So I'd like to make it fast, without compromising readability (it's called roughly every 100ms).
The first thing that comes to mind is to do the two sums at once (the positive numbers and all the numbers) in a single loop. It can easily be done in a for loop, but can it be done with collection functions?
The next option I was thinking about is to get rid of the queue and use a circular buffer, but since the code is run on a part of the buffer (the last 'length' items), I'd have to handle the wrap around part; I guess I could extend the buffer to the size of a power of 2 and get automatic wrap around that way.
Any idea is welcome, but my first original question is: can I do the two sums in a single pass with the collection functions? I can't iterate in the queue with an indexer, so I can't use a for loop (or I guess I'd have to instance an iterator)
First of all, there is nothing inherently wrong with using mutable variables and loops in F#. Especially at a small scale (e.g. inside a function), this can often be quite readable - or at least, easy to understand if there is a suitable comment.
To do this using a single iteration, you could use fold. This basically calculates the two sums in a single iteration at the cost of some readability:
let queueDirectionFold length =
let boughtVolume, totalVolume =
volumeQueue
|> Seq.skip (queueSize - length)
|> Seq.fold (fun (bv, tv) v ->
(if v > 0.0 then bv else bv + v), tv + abs v) (0.0, 0.0)
2. * boughtVolume / totalVolume - 1.
As I mentioned earlier, I would also consider using a loop. The loop itself is quite simple, but some complexity is added by the fact that you need to skip some elements. Still, I think it's quite clear:
let queueDirectionLoop length =
let mutable i = 0
let mutable boughtVolume = 0.
let mutable totalVolume = 0.
for v in volumeQueue do
if i >= queueSize - length then
totalVolume <- totalVolume + abs v
if v > 0. then boughtVolume <- boughtVolume + v
i <- i + 1
2. * boughtVolume / totalVolume - 1.
I tested the performance using 4000 elements and here is what I got:
#time
let rnd = System.Random()
for i in 0 .. 4000 do volumeQueue.Enqueue(rnd.NextDouble())
for i in 0 .. 10000 do ignore(queueDirection 1000) // ~900 ms
for i in 0 .. 10000 do ignore(queueDirectionFold 1000) // ~460 ms
for i in 0 .. 10000 do ignore(queueDirectionLoop 1000) // ~370 ms
Iterating over the queue just once definitely helps with performance. Doing this in an imperative loop helps the performance even more - this may be worth it if you care about performance. The code may be a bit less readable than the original, but I think it's not much worse than fold.

Can I check whether a bounded list contains duplicates, in linear time?

Suppose I have an Int list where elements are known to be bounded and the list is known to be no longer than their range, so that it is entirely possible for it not to contain duplicates. How can I test most quickly whether it is the case?
I know of nubOrd. It is quite fast. We can pass our list through and see if it becomes shorter. But the efficiency of nubOrd is still not linear.
My idea is that we can trade space for time efficiency. Imperatively, we would allocate a bit field as wide as our range, and then traverse the list, marking the entries corresponding to the list elements' values. As soon as we try to flip a bit that is already 1, we return False. It only takes (read + compare + write) * length of the list. No binary search trees, no nothing.
Is it reasonable to attempt a similar construction in Haskell?
The discrimination package has a linear time nub you can use. Or a linear time group that doesn't require the equivalent elements to be adjacent in order to group them, so you could see if any of the groups are not size 1.
The whole package is based on sidestepping the well known bounds on comparison-based sorts (and joins, and etc) by using algorithms based on "discrimination" rather than ones based on comparisons. As I understand it, the technique is somewhat like a radix sort, but generalised to ADTs.
For integers (and other Ix-like types), you could use a mutable array, for example with the array package.
We can here use a STUArray here, like:
import Control.Monad.ST
import Data.Array.ST
updateDups_ :: [Int] -> STArray s Int Bool -> ST s Bool
updateDups_ [] _ = return False
updateDups_ (x:xs) arr = do
contains <- readArray arr x
if contains then return True
else writeArray arr x True >> updateDups_ xs arr
withDups_ :: Int -> [Int] -> ST s Bool
withDups_ mx l = newArray (0, mx) False >>= updateDups_ l
withDups :: Int -> [Int] -> Bool
withDups mx ls = runST (withDups_ mx ls)
For example:
Prelude Control.Monad.ST Data.Array.ST> withDups 17 [1,4,2,5]
False
Prelude Control.Monad.ST Data.Array.ST> withDups 17 [1,4,2,1]
True
Prelude Control.Monad.ST Data.Array.ST> withDups 17 [1,4,2,16,2]
True
So here the first parameter is the maximum value that can be added in the list, and the second parameter the list of values we want to check.
So you have a list of size N, and you know that the elements in the list are within the range min .. min+N-1.
There is a simple linear time algorithm that requires O(1) space.
First, scan the list to find the minimum and maximum elements.
If (max - min + 1) < N then you know there's a duplicate. Otherwise ...
Because the range is N, the minimum item can go at a[0], and the max item at a[n-1]. You can map any item to its position in the array simply by subtracting min. You can do an in-place sort in O(n) because you know exactly where every item should go.
Starting at the beginning of the list, take the first element and subtract min to determine where it should go. Go to that position, and replace the item that's there. With the new item, compute where it should go, and replace the item in that position, etc.
If you ever get to a point where you're you're trying to place an item at a[x], and the value already there is the value that's supposed to be there (i.e. a[x] == x+min), then you've found a duplicate.
The code to do all this is pretty simple:
Corrected code.
min, max = findMinMax()
currentIndex = 0
while currentIndex < N
temp = a[currentIndex]
targetIndex = temp - min;
// Do this until we wrap around to the current index
// If the item is already in place, then targetIndex == currentIndex,
// and we won't enter the loop.
while targetIndex != currentIndex
if (a[targetIndex] == temp)
// the item at a[targetIndex] is the item that's supposed to be there.
// The only way that can happen is if the item we have in temp is a duplicate.
found a duplicate
end if
save = a[targetIndex]
a[targetIndex] = temp
temp = save
targetIndex = temp - min
end while
// At this point, targetIndex == currentIndex.
// We've wrapped around and need to place the last item.
// There's no need to check here if a[targetIndex] == temp, because if it did,
// we would not have entered the loop.
a[targetIndex] = temp
++currentIndex
end while
That's the basic idea.

Proving that there are no overlapping sub-problems?

I just got the following interview question:
Given a list of float numbers, insert “+”, “-”, “*” or “/” between each consecutive pair of numbers to find the maximum value you can get. For simplicity, assume that all operators are of equal precedence order and evaluation happens from left to right.
Example:
(1, 12, 3) -> 1 + 12 * 3 = 39
If we built a recursive solution, we would find that we would get an O(4^N) solution. I tried to find overlapping sub-problems (to increase the efficiency of this algorithm) and wasn't able to find any overlapping problems. The interviewer then told me that there wasn't any overlapping subsolutions.
How can we detect when there are overlapping solutions and when there isn't? I spent a lot of time trying to "force" subsolutions to appear and eventually the Interviewer told me that there wasn't any.
My current solution looks as follows:
def maximumNumber(array, current_value=None):
if current_value is None:
current_value = array[0]
array = array[1:]
if len(array) == 0:
return current_value
return max(
maximumNumber(array[1:], current_value * array[0]),
maximumNumber(array[1:], current_value - array[0]),
maximumNumber(array[1:], current_value / array[0]),
maximumNumber(array[1:], current_value + array[0])
)
Looking for "overlapping subproblems" sounds like you're trying to do bottom up dynamic programming. Don't bother with that in an interview. Write the obvious recursive solution. Then memoize. That's the top down approach. It is a lot easier to get working.
You may get challenged on that. Here was my response the last time that I was asked about that.
There are two approaches to dynamic programming, top down and bottom up. The bottom up approach usually uses less memory but is harder to write. Therefore I do the top down recursive/memoize and only go for the bottom up approach if I need the last ounce of performance.
It is a perfectly true answer, and I got hired.
Now you may notice that tutorials about dynamic programming spend more time on bottom up. They often even skip the top down approach. They do that because bottom up is harder. You have to think differently. It does provide more efficient algorithms because you can throw away parts of that data structure that you know you won't use again.
Coming up with a working solution in an interview is hard enough already. Don't make it harder on yourself than you need to.
EDIT Here is the DP solution that the interviewer thought didn't exist.
def find_best (floats):
current_answers = {floats[0]: ()}
floats = floats[1:]
for f in floats:
next_answers = {}
for v, path in current_answers.iteritems():
next_answers[v + f] = (path, '+')
next_answers[v * f] = (path, '*')
next_answers[v - f] = (path, '-')
if 0 != f:
next_answers[v / f] = (path, '/')
current_answers = next_answers
best_val = max(current_answers.keys())
return (best_val, current_answers[best_val])
Generally the overlapping sub problem approach is something where the problem is broken down into smaller sub problems, the solutions to which when combined solve the big problem. When these sub problems exhibit an optimal sub structure DP is a good way to solve it.
The decision about what you do with a new number that you encounter has little do with the numbers you have already processed. Other than accounting for signs of course.
So I would say this is a over lapping sub problem solution but not a dynamic programming problem. You could use dive and conquer or evenmore straightforward recursive methods.
Initially let's forget about negative floats.
process each new float according to the following rules
If the new float is less than 1, insert a / before it
If the new float is more than 1 insert a * before it
If it is 1 then insert a +.
If you see a zero just don't divide or multiply
This would solve it for all positive floats.
Now let's handle the case of negative numbers thrown into the mix.
Scan the input once to figure out how many negative numbers you have.
Isolate all the negative numbers in a list, convert all the numbers whose absolute value is less than 1 to the multiplicative inverse. Then sort them by magnitude. If you have an even number of elements we are all good. If you have an odd number of elements store the head of this list in a special var , say k, and associate a processed flag with it and set the flag to False.
Proceed as before with some updated rules
If you see a negative number less than 0 but more than -1, insert a / divide before it
If you see a negative number less than -1, insert a * before it
If you see the special var and the processed flag is False, insert a - before it. Set processed to True.
There is one more optimization you can perform which is removing paris of negative ones as candidates for blanket subtraction from our initial negative numbers list, but this is just an edge case and I'm pretty sure you interviewer won't care
Now the sum is only a function of the number you are adding and not the sum you are adding to :)
Computing max/min results for each operation from previous step. Not sure about overall correctness.
Time complexity O(n), space complexity O(n)
const max_value = (nums) => {
const ops = [(a, b) => a+b, (a, b) => a-b, (a, b) => a*b, (a, b) => a/b]
const dp = Array.from({length: nums.length}, _ => [])
dp[0] = Array.from({length: ops.length}, _ => [nums[0],nums[0]])
for (let i = 1; i < nums.length; i++) {
for (let j = 0; j < ops.length; j++) {
let mx = -Infinity
let mn = Infinity
for (let k = 0; k < ops.length; k++) {
if (nums[i] === 0 && k === 3) {
// If current number is zero, removing division
ops.splice(3, 1)
dp.splice(3, 1)
continue
}
const opMax = ops[j](dp[i-1][k][0], nums[i])
const opMin = ops[j](dp[i-1][k][1], nums[i])
mx = Math.max(opMax, opMin, mx)
mn = Math.min(opMax, opMin, mn)
}
dp[i].push([mx,mn])
}
}
return Math.max(...dp[nums.length-1].map(v => Math.max(...v)))
}
// Tests
console.log(max_value([1, 12, 3]))
console.log(max_value([1, 0, 3]))
console.log(max_value([17,-34,2,-1,3,-4,5,6,7,1,2,3,-5,-7]))
console.log(max_value([59, 60, -0.000001]))
console.log(max_value([0, 1, -0.0001, -1.00000001]))

Dynamic programming sum

How would you use dynamic programming to find the list of positive integers in an array whose sum is closest to but not equal to some positive integer K?
I'm a little stuck thinking about this.
The usual phrasing for this is that you're looking for the value closest to, but not exceeding K. If you mean "less than K", it just means that your value of K is one greater than the usual. If you truly mean just "not equal to K", then you'd basically run through the algorithm twice, once finding the largest sum less than K, then again finding the smallest sum greater than K, then picking the one of those whose absolute difference from K is the smallest.
For the moment I'm going to assume you really mean the largest sum less than or equal to K, since that's the most common formulation, and the other possibilities don't really have much affect on the algorithm.
The basic idea is fairly simple, though it (at least potentially) uses a lot of storage. We build a table with K+1 columns and N+1 rows (where N = number of inputs). We initialize the first row in the table to 0's.
Then we start walking through the table, and building the best value we can for each possible maximum value up to the real maximum, going row by row so we start with only a single input, then two possible inputs, then three, and so on. At each spot in the table, there are only two possibilities for the best value: the previous best value that doesn't use the current input, or else the current input plus the previous best value for the maximum minus the current input (and since we compute the table values in order, we'll always already have that value).
We also usually want to keep track of which items were actually used to produce the result. To do that, we set a Boolean for a given spot in the table to true if and only if we compute a value for that spot in the table using the new input for that row (rather than just copying the previous row's best value). The best result is in the bottom, right-hand corner, so we start there, and walk backward through the table one row at a time. When we get to a row where the Boolean for that column was set to true, we know we found an input that was used. We print out that item, and then subtract that from the total to get the next column to the left where we'll find the next input that was used to produce this output.
Here's an implementation that's technically in C++, but written primarily in a C-like style to make each step as explicit as possible.
#include <iostream>
#include <functional>
#define elements(array) (sizeof(array)/sizeof(array[0]))
int main() {
// Since we're assuming subscripts from 1..N, I've inserted a dummy value
// for v[0].
int v[] = {0, 7, 15, 2, 1};
// For the moment I'm assuming a maximum <= MAX.
const int MAX = 17;
// ... but if you want to specify K as the question implies, where sum<K,
// you can get rid of MAX and just specify K directly:
const int K = MAX + 1;
const int rows = elements(v);
int table[rows][K] = {0};
bool used[rows][K] = {false};
for (int i=1; i<rows; i++)
for (int c = 0; c<K; c++) {
int prev_val = table[i-1][c];
int new_val;
// we compute new_val inside the if statement so we won't
// accidentally try to use a negative column from the table if v[i]>c
if (v[i] <= c && (new_val=v[i]+table[i-1][c-v[i]]) > prev_val) {
table[i][c] = new_val;
used[i][c] = true;
}
else
table[i][c] = prev_val;
}
std::cout << "Result: " << table[rows-1][MAX] << "\n";
std::cout << "Used items where:\n";
int column = MAX;
for (int i=rows; i>-1; i--)
if (used[i][column]) {
std::cout << "\tv[" << i << "] = " << v[i] << "\n";
column -= v[i];
}
return 0;
}
There are a couple of things you'd normally optimize in this (that I haven't for the sake of readability). First, if you reach an optimum sum, you can stop searching, so in this case we could actually break out of the loop before considering the final input of 1 at all (since 15 and 2 give the desired result of 17).
Second, in the table itself we only really use two rows at any given time: one current row and one previous row. The rows before that (in the main table) are never used again (i.e., to compute row[n] we need the values from row[n-1], but not row[n-2], row[n-3], ... row[0]. To reduce storage, we can make the main table be only two rows, and we swap between the first and second rows. A very C-like trick to do that would be to use only the least significant bit of the row number, so you'd replace table[i] and table[i-1] with table[i&1] and table[(i-1)&1] respectively (but only for the main table -- not when addressing the used table.
Here is an example in python:
def closestSum(a,k):
s={0:[]}
for x in a:
ns=dict(s)
for j in s:
ns[j+x]=s[j]+[x]
s=ns
if k in s:
del s[k]
return s[min(s,key=lambda i:abs(i-k))]
Example:
>>> print closestSum([1,2,5,6],10)
[1, 2, 6]
The idea is simply to keep track of what sums can be made from all previous elements as you go through the array, as well as one way to make that sum. At the end, you just pick the closest to what you want. It is a dynamic programming solution because it breaks the overall problem down into sub-problems, and uses a table to remember the results of the sub-problems instead of recalculating them.
Cato's idea in Racket:
#lang racket
(define (closest-sum xs k)
(define h (make-hash '([0 . ()])))
(for* ([x xs] [(s ys) (hash-copy h)])
(hash-set! h (+ x s) (cons x ys))
(hash-set! h x (list x)))
(when (hash-ref h k #f) (hash-remove! h k))
(cdr (argmin (λ (a) (abs (- k (car a)))) (hash->list h))))
To get an even terser program, one can grab terse-hash.rkt from GitHub and write:
(define (closest-sum xs k)
(define h {make '([0 . ()])})
(for* ([x xs] [(s ys) {copy h}])
{! h (+ x s) (cons x ys)}
{! h x (list x)})
(when {h k #f} {remove! h k})
(cdr (argmin (λ (a) (abs (- k (car a)))) {->list h})))

F# Efficiently removing n items from the end of a Set

I know I can remove the last element from a set:
s.Remove(s.MaximumElement)
But if I want to remove the n maximum elements... do I just execute the above n times, or is there a faster way to do that?
To be clear, this is an obvious solution:
let rec removeLastN (s : Set<'a>, num : int) : Set<'a> =
match num with
| 0 -> s
| _ -> removeLast(s.Remove(s.MinimumElement), num-1)
But it involves creating a new set n times. Is there a way to do it and only create a new set once?
But it involves creating a new set n
times. Is there a way to do it and
only create a new set once?
To the best of my knowledge, no. I'd say what you have a perfectly fine implementation, it runs in O(lg n) -- and its concise too :) Most heap implementations give you O(lg n) for delete min anyway, so what you have is about as good as you can get it.
You might be able to get a little better speed by rolling your balanced tree, and implementing a function to drop a left or right branch for all values greater than a certain value. I don't think an AVL tree or RB tree are appropriate in this context, since you can't really maintain their invariants, but a randomlized tree will give you the results you want.
A treap works awesome for this, because it uses randomization rather than tree invariants to keep itself relatively balanced. Unlike an AVL tree or a RB-tree, you can split a treap on a node without worrying about it being unbalanced. Here's a treap implementation I wrote a few months ago:
http://pastebin.com/j0aV3DJQ
I've added a split function, which will allows you take a tree and return two trees containing all values less than and all values greater than a given value. split runs in O(lg n) using a single pass through the tree, so you can prune entire branches of your tree in one shot -- provided that you know which value to split on.
But if I want to remove the n maximum
elements... do I just execute the
above n times, or is there a faster
way to do that?
Using my Treap class:
open Treap
let nthLargest n t = Seq.nth n (Treap.toSeqBack t)
let removeTopN n t =
let largest = nthLargest n t
let smallerValues, wasFound, largerValues = t.Split(largest)
smallerValues
let e = Treap.empty(fun (x : int) (y : int) -> x.CompareTo(y))
let t = [1 .. 100] |> Seq.fold (fun (acc : Treap<_>) x -> acc.Insert(x)) e
let t' = removeTopN 10 t
removeTopN runs in O(n + lg m) time, where n is the index into the tree sequence and m is the number of items in the tree.
I make no guarantees about the accuracy of my code, use at your own peril ;)
In F#, you can use Set.partition or Set.filter to create sub sets:
let s = Set([1;4;6;9;100;77])
let a, b = Set.partition (fun x -> x <= 10) s
let smallThan10 = Set.filter (fun x -> x < 10) s
In your question, maybe you don't know the value of the ith number of your set, so here is a handy function for that:
let nth (n:int) (s:'a Set) =
s |> Set.toSeq |> Seq.nth n
Now, we can write the remove-top-n function:
let removeTopN n (s:'a Set) =
let size = s.Count
let m = size - n
let mvalue = nth m s
Set.filter (fun x -> x < mvalue) s
and test it:
removeTopN 3 s
and we get:
val it : Set<int> = set [1; 4; 6]
Notice that the removeTopN does not work for a set containing multiple same values.
That is already a pretty good solution. OCaml has a split function that can split a Set so you can find the right element then you can split the Set to remove a bunch of elements at a time. Alternatively, you can use Set.difference to extract another Set of elements.

Resources