Most efficient data structure for finding most frequent items - sorting

I want to extract most frequent words from the Google N-Grams dataset which is about 20 GB in its uncompressed form. I don't want the whole data set resorted, just most frequent 5000 of them. But if I write
take 5000 $ sortBy (flip $ comparing snd) dataset
-- dataset :: IO [(word::String, frequency::Int)]
it's going to be an endless waiting. But what should I do instead?
I know there is Data.Array.MArray package available for in-place array computation, but I cannot see any function for items modification on its documentation page. There is also Data.HashTable.IO, but it's an unordered data structure.
I'd like to use simple Data.IntMap.Strict (with its convenient lookupLE function), but I don't think it would be very efficient because it produces a new map on each alteration. Could ST monad improve that?
UPD: I've also posted the final version of program on CoreReview.SX.

How about
using splitAt to divide the data set into the first 5000 items and the rest.
sort the first 5000 items by frequency (ascending)
go through the rest
if a item has greater frequency than the lowest freq in the sorted items
drop the lowest frequency item from the sorted items
insert the new item in its proper place in the sorted items
The process then becomes effectively linear, though the coefficient is improved if you use a data structure for the sorted 5000 elements that has sublinear min-delete and insertion.
For example, using Data.Heap from the heap package:
import Data.List (foldl')
import Data.Maybe (fromJust)
import Data.Heap hiding (splitAt)
mostFreq :: Int -> [(String, Int)] -> [(String, Int)]
mostFreq n dataset = final
where
-- change our pairs from (String,Int) to (Int,String)
pairs = map swap dataset
-- get the first `n` pairs in one list, and the rest of the pairs in another
(first, rest) = splitAt n pairs
-- put all the first `n` pairs into a MinHeap
start = fromList first :: MinHeap (Int, String)
-- then run through the rest of the pairs
stop = foldl' step start rest
-- modifying the heap to replace its least frequent pair
-- with the new pair if the new pair is more frequent
step heap pair = if viewHead heap < Just pair
then insert pair (fromJust $ viewTail heap)
else heap
-- turn our heap of (Int, String) pairs into a list of (String,Int) pairs
final = map swap (toList stop)
swap ~(a,b) = (b,a)

Have you tried this or are you just guessing? Because many Haskell sort functions respect laziness and when you ask for only the top 5000 they'll happily avoid sorting the rest of those elements.
Similarly, be very careful with "it produces a new map on each alteration". Most insert operations are going to be O(log n) on this sort of data structure, with n bounded to 5000: so you might be allocating ~30 new cells in the heap on each alteration, but that's not a particularly huge cost, certainly not as huge as 5000.
What you'd want instead, if Data.List.sort doesn't work well enough, is something like:
import Data.List (foldl')
import Data.IntMap.Strict (IntMap)
import qualified Data.IntMap.Strict as IM
type Freq = Int
type Count = Int
data Summarizer x = Summ {tracking :: !IntMap [x], least :: !Freq,
size :: !Count, size_of_least :: !Count }
inserting :: x -> Maybe [x] -> Maybe [x]
inserting x Nothing = Just [x]
inserting x (Just xs) = Just (x:xs)
sizeLimit :: Summarizer x -> Summarizer x
sizeLimit skip#(Summ strs f_l tot lst)
| tot - lst < 5000 = skip
| otherwise = Summ strs' f_l' tot' lst'
where (discarded, strs') = IM.deleteFindMin strs
(f_l', new_least) = IM.findMin dps'
tot' = tot - length discarded
lst' = length new_least
addEl :: (x, Freq) -> Summarizer x -> Summarizer x
addEl (str, f) skip#(Summ strs f_l tot lst)
| i < f_l && tot >= 5000 = skip
| otherwise = sizeLimit $ Summ strs' f_l' tot' lst'
where strs' = IM.alter (inserting str) f strs
tot' = tot + 1
f_l' = min f_l f
lst' = case compare f_l f of LT -> lst; EQ -> lst + 1; GT -> 1
Notice that we store lists of strings to handle duplicate frequencies; we mostly skip updating, and when we do update it's an O(log n) operation to put the new element in and sometimes (depending on duplication again) an O(log n) operation to prune out the smallest elements, and an O(log n) operation to find the new smallest ones.

Related

Random number generation in OCaml

When using strict functional languages you are bound to a way of writing programs. I come with the problem of generating large quantity of pseudo random numbers with OCaml and I'm not sure I'm using the best way to generate this numbers on such language.
What I did was create a module with a function (gen) that takes an integer as the size and an empty list and returns a list of pseudo random numbers of size size. The problem is when the size is to large, it asserts a StackOverflow which is what is expected.
Should I use tail recursion? Should I use a better method that I'm not aware of?
module RNG =
struct
(* Append a number n in the end of the list l *)
let rec append l n =
match l with
| [] -> [n]
| h :: t -> h :: (append t n)
(* Generate a list l with size random numbers *)
let rec gen size l =
if size = 0 then
l
else
let n = Random.int 1000000 in
let list = append l n in
gen (size - 1) list
end
Testing the code to generate a billion pseudo random numbers returns:
# let l = RNG.gen 1000000000 [];;
Stack overflow during evaluation (looping recursion?).
The problem is that the append function is not tail recursive. Each recursion uses up a bit of stack space to store it's state and as the list gets longer the append function takes more and more stack space. As some point the stack simply isn't big enough and the code fails.
As you suggested in the question the way to avoid that is using tail recursion. When working with lists that usually means constructing the lists in reverse order. The append function then becomes simply ::.
If the order of the resulting list is important the list needs to be reversed at the end. So it is not uncommon to see code returning List.rev acc. This takes O(n) time but constant space and is tail recursive. So the stack is no limit there.
So your code would become:
let rec gen size l =
if size = 0 then
List.rev l
else
let n = Random.int 1000000 in
let list = n :: l in
gen (size - 1) list
A few more things to optimize:
When building a result bit by bit through recursion the result is usually names acc, short for accumulator, and passed first:
let rec gen acc size =
if size = 0 then
List.rev acc
else
let n = Random.int 1000000 in
let list = n :: acc in
gen list (size - 1)
This then allows the use of function and pattern matching instead of the size argument and if construct:
let rec gen acc = function
| 0 -> List.rev acc
| size ->
let n = Random.int 1000000 in
let list = n :: acc in
gen list (size - 1)
A list of random numbers is usually just as good reversed. Unless you want lists of different sizes but using the same seed to begin with the same sequence of numbers you can skip the List.rev. And n :: acc is such a simple costruct one usually doesn't bind that to a variable.
let rec gen acc = function
| 0 -> acc
| size ->
let n = Random.int 1000000 in
gen (n :: acc) (size - 1)
And last you can take advantage of optional arguments. While that makes the code a bit more complex to read it greatly simplifies it's use:
let rec gen ?(acc=[]) = function
| 0 -> acc
| size ->
let n = Random.int 1000000 in
gen ~acc:(n :: acc) (size - 1)
# gen 5;;
- : int list = [180439; 831641; 180182; 326685; 809344]
You no longer need to specify the empty list to generate a list of random number.
Note: An alternative way is to use a wrapper function:
let gen size =
let rec loop acc = function
| 0 -> acc
| size ->
let n = Random.int 1000000 in
loop (n :: acc) (size - 1)
in loop [] size
It would be a big improvement to generate your list in reverse order, then reverse it once at the end. Adding successive values to the end of a list is very slow. Adding to the front of a list can be done in constant time.
Even better, just generate the list in reverse order and return it that way. Do you care that the list is in the same order that the values were generated?
Why do you need to compute the full list explicitly? Another option might be to generate the element lazily (and deterministically) using the new sequence module:
let rec random_seq state () =
let state' = Random.State.copy state in
Seq.Cons(Random.State.int state' 10, random_seq state')
Then the random sequence random_seq state is fully determined by the initial state state: it can be both reused without troubles and only generate new elements as needed.
The standard List module has an init function you can use to write all this in one line:
let upperbound = 10
let rec gen size =
List.init size (fun _ -> Random.int upperbound)

Haskell foldl' not saving the space it was expected to

Trying to implement the straightforward dynamic programming algorithm for the Knapsack problem. Obviously this approach uses a lot of memory and so I am trying to optimize the memory utilized. I am simply trying to store only the previous row of my table in memory just long enough to compute the next row, and so on. At first I thought my implementation was solid, but it still ran out of memory as an implementation designed to store the whole table. So next I thought maybe I need foldl' instead of foldr, but it did not make any difference. My program continues to eat memory until my system runs out.
So I have 2 specific questions:
What is it about my code that is using up all the memory? I thought I was being clever by using a fold, because I assumed only the current value of the accumulator would be stored in memory.
What is the proper approach for achieving my goal; that is, storing only the most recent row in memory? I don't necessarily need code, maybe just some helpful functions and data types. More generally, what are some tips and techniques for understanding memory usage in Haskell?
Here is my implementation
data KSItem a = KSItem { ksItem :: a, ksValue :: Int, ksWeight :: Int} deriving (Eq, Show, Ord)
dynapack5 size items = finalR ! size
where
noItems = length items
itemsArr = listArray(1,noItems) items
row = listArray(1,size) (replicate size (0,[]))
computeRow row item =
let w = ksWeight item
v = ksValue item
idx = ksItem item
pivot = let (lastVal, selections) = row ! w
in if v > lastVal
then (v, [idx])
else (lastVal, selections)
figure r c =
if (prevVal + v) > lastVal
then (prevVal + v, prevItems ++ [idx])
else (lastVal, lastItems)
where (lastVal, lastItems) = (r ! c)
(prevVal, prevItems) = (r ! (c - w))
theRest = [ (figure row cw) | cw <- [(w+1)..size] ]
newRow = (map (row!) [1..(w-1)]) ++
[pivot] ++
theRest
in listArray (1,size) newRow
finalR = foldl' computeRow row items
In my head, what I think this is doing is initializing the first row to (0,[])... repeated as necessary, then kicking off the fold where the next row is calculated based on the supplied row, and this value then becomes the accumulator. I'm not seeing where more and more memory is being consumed...
Random thought: what if i used the \\ operator on the accumulator instead?
As Tom Ellis said, using force on the array solves the space issues. However, it is extremely slow, because force traverses all the lists in the array from start to end each time it is invoked. So we should only force as needed:
let res = listArray (1,size) newRow in force (map fst $ elems res) `seq` res
This fixes the space leak and it's also pretty fast.
If you want to take space efficiency to the logical next step, you could use bitsets of the indices of the items instead of lists of items. Integers are good for the job here since they automatically resize themselves to accommodate the highest set bit. Also, with Integer-s forcing is straightforward:
import qualified Data.Vector as V -- using this instead of Array cause I like it more
import Data.List
import Control.Arrow
import Data.Bits
import Control.DeepSeq
data KSItem a = KSItem { ksItem :: a, ksValue :: Int, ksWeight :: Int} deriving (Eq, Show, Ord)
dynapack5' :: Int -> [KSItem a] -> (Int, Integer)
dynapack5' size items = V.last solutions where
items' = [KSItem i v w | (i, KSItem _ v w) <- zip [0..] items]
solutions = foldl' add (V.replicate (size + 1) (0, 0::Integer)) items'
add arr (KSItem item currVal w) = force $ V.imap go arr where
go i (v, is) | w < i && v' > v = (v', is')
| otherwise = (v, is)
where (v', is') = (+currVal) *** (`setBit` item) $ arr V.! (i - w)
Data.Array is non-strict in its elements so even though foldl' forces it to WHNF each time around the loop the contents don't get evaluated. The simplest fix would be to import Control.DeepSeq and change
in listArray (1,size) newRow
to
in force (listArray (1,size) newRow)
This is doing more work than strictly necessary each time around the loop, but will do the job.
Unfortunately you can't just substitute unboxed arrays here, since your arrays contain a tuple containing a list.

Finding unique (as in only occurring once) element haskell

I need a function which takes a list and return unique element if it exists or [] if it doesn't. If many unique elements exists it should return the first one (without wasting time to find others).
Additionally I know that all elements in the list come from (small and known) set A.
For example this function does the job for Ints:
unique :: Ord a => [a] -> [a]
unique li = first $ filter ((==1).length) ((group.sort) li)
where first [] = []
first (x:xs) = x
ghci> unique [3,5,6,8,3,9,3,5,6,9,3,5,6,9,1,5,6,8,9,5,6,8,9]
ghci> [1]
This is however not good enough because it involves sorting (n log n) while it could be done in linear time (because A is small).
Additionally it requires the type of list elements to be Ord while all which should be needed is Eq. It would also be nice if amount of comparisons was as small as possible (ie if we traverse a list and encounter element el twice we don't test subsequent elements for equality with el)
This is why for example this: Counting unique elements in a list doesn't solve the problem - all answers involve either sorting or traversing the whole list to find count of all elements.
The question is: how to do it correctly and efficiently in Haskell ?
Okay, linear time, from a finite domain. The running time will be O((m + d) log d), where m is the size of the list and d is the size of the domain, which is linear when d is fixed. My plan is to use the elements of the set as the keys of a trie, with the counts as values, then look through the trie for elements with count 1.
import qualified Data.IntTrie as IntTrie
import Data.List (foldl')
import Control.Applicative
Count each of the elements. This traverses the list once, builds a trie with the results (O(m log d)), then returns a function which looks up the result in the trie (with running time O(log d)).
counts :: (Enum a) => [a] -> (a -> Int)
counts xs = IntTrie.apply (foldl' insert (pure 0) xs) . fromEnum
where
insert t x = IntTrie.modify' (fromEnum x) (+1) t
We use the Enum constraint to convert values of type a to integers in order to index them in the trie. An Enum instance is part of the witness of your assumption that a is a small, finite set (Bounded would be the other part, but see below).
And then look for ones that are unique.
uniques :: (Eq a, Enum a) => [a] -> [a] -> [a]
uniques dom xs = filter (\x -> cts x == 1) dom
where
cts = counts xs
This function takes as its first parameter an enumeration of the entire domain. We could have required a Bounded a constraint and used [minBound..maxBound] instead, which is semantically appealing to me since finite is essentially Enum+Bounded, but quite inflexible since now the domain needs to be known at compile time. So I would choose this slightly uglier but more flexible variant.
uniques traverses the domain once (lazily, so head . uniques dom will only traverse as far as it needs to to find the first unique element -- not in the list, but in dom), for each element running the lookup function which we have established is O(log d), so the filter takes O(d log d), and building the table of counts takes O(m log d). So uniques runs in O((m + d) log d), which is linear when d is fixed. It will take at least Ω(m log d) to get any information from it, because it has to traverse the whole list to build the table (you have to get all the way to the end of the list to see if an element was repeated, so you can't do better than this).
There really isn't any way to do this efficiently with just Eq. You'd need to use some much less efficient way to build the groups of equal elements, and you can't know that only one of a particular element exists without scanning the whole list.
Also, note that to avoid useless comparisons you'd need a way of checking to see if an element has been encountered before, and the only way to do that would be to have a list of elements known to have multiple occurrences, and the only way to check if the current element is in that list is... to compare it for equality with each.
If you want this to work faster than O(something really horrible) you need that Ord constraint.
Ok, based on the clarifications in comments, here's a quick and dirty example of what I think you're looking for:
unique [] _ _ = Nothing
unique _ [] [] = Nothing
unique _ (r:_) [] = Just r
unique candidates results (x:xs)
| x `notElem` candidates = unique candidates results xs
| x `elem` results = unique (delete x candidates) (delete x results) xs
| otherwise = unique candidates (x:results) xs
The first argument is a list of candidates, which should initially be all possible elements. The second argument is the list of possible results, which should initially be empty. The third argument is the list to examine.
If it runs out of candidates, or reaches the end of the list with no results, it returns Nothing. If it reaches the end of the list with results, it returns the one at the front of the result list.
Otherwise, it examines the next input element: If it's not a candidate, it ignores it and continues. If it's in the result list we've seen it twice, so remove it from the result and candidate lists and continue. Otherwise, add it to the results and continue.
Unfortunately, this still has to scan the entire list for even a single result, since that's the only way to be sure it's actually unique.
First off, if your function is intended to return at most one element, you should almost certainly use Maybe a instead of [a] to return your result.
Second, at minimum, you have no choice but to traverse the entire list: you can't tell for sure if any given element is actually unique until you've looked at all the others.
If your elements are not Ordered, but can only be tested for Equality, you really have no better option than something like:
firstUnique (x:xs)
| elem x xs = firstUnique (filter (/= x) xs)
| otherwise = Just x
firstUnique [] = Nothing
Note that you don't need to filter out the duplicated elements if you don't want to -- the worst case is quadratic either way.
Edit:
The above misses the possibility of early exit due to the above-mentioned small/known set of possible elements. However, note that the worst case will still require traversing the entire list: all that is necessary is for at least one of these possible elements to be missing from the list...
However, an implementation that provides an early out in case of set exhaustion:
firstUnique = f [] [<small/known set of possible elements>] where
f [] [] _ = Nothing -- early out
f uniques noshows (x:xs)
| elem x uniques = f (delete x uniques) noshows xs
| elem x noshows = f (x:uniques) (delete x noshows) xs
| otherwise = f uniques noshows xs
f [] _ [] = Nothing
f (u:_) _ [] = Just u
Note that if your list has elements which shouldn't be there (because they aren't in the small/known set), they will be pointedly ignored by the above code...
As others have said, without any additional constraints, you can't do this in less than quadratic time, because without knowing something about the elements, you can't keep them in some reasonable data structure.
If we are able to compare elements, an obvious O(n log n) solution to compute the count of elements first and then find the first one with count equal to 1:
import Data.List (foldl', find)
import Data.Map (Map)
import qualified Data.Map as Map
import Data.Maybe (fromMaybe)
count :: (Ord a) => Map a Int -> a -> Int
count m x = fromMaybe 0 $ Map.lookup x m
add :: (Ord a) => Map a Int -> a -> Map a Int
add m x = Map.insertWith (+) x 1 m
uniq :: (Ord a) => [a] -> Maybe a
uniq xs = find (\x -> count cs x == 1) xs
where
cs = foldl' add Map.empty xs
Note that the log n factor comes from the fact that we need to operate on a Map of size n. If the list has only k unique elements then the size of our map will be at most k, so the overall complexity will be just O(n log k).
However, we can do even better - we can use a hash table instead of a map to get an O(n) solution. For this we'll need the ST monad to perform mutable operations on the hash map, and our elements will have to be Hashable. The solution is basically the same as before, just a little bit more complex due to working within the ST monad:
import Control.Monad
import Control.Monad.ST
import Data.Hashable
import qualified Data.HashTable.ST.Basic as HT
import Data.Maybe (fromMaybe)
count :: (Eq a, Hashable a) => HT.HashTable s a Int -> a -> ST s Int
count ht x = liftM (fromMaybe 0) (HT.lookup ht x)
add :: (Eq a, Hashable a) => HT.HashTable s a Int -> a -> ST s ()
add ht x = count ht x >>= HT.insert ht x . (+ 1)
uniq :: (Eq a, Hashable a) => [a] -> Maybe a
uniq xs = runST $ do
-- Count all elements into a hash table:
ht <- HT.newSized (length xs)
forM_ xs (add ht)
-- Find the first one with count 1
first (\x -> liftM (== 1) (count ht x)) xs
-- Monadic variant of find which exists once an element is found.
first :: (Monad m) => (a -> m Bool) -> [a] -> m (Maybe a)
first p = f
where
f [] = return Nothing
f (x:xs') = do
b <- p x
if b then return (Just x)
else f xs'
Notes:
If you know that there will be only a small number of distinct elements in the list, you could use HT.new instead of HT.newSized (length xs). This will save you some memory and one pass over xs but in the case of many distinct elements the hash table will be have to resized several times.
Here is a version that does the trick:
unique :: Eq a => [a] -> [a]
unique = select . collect []
where
collect acc [] = acc
collect acc (x : xs) = collect (insert x acc) xs
insert x [] = [[x]]
insert x (ys#(y : _) : yss)
| x == y = (x : ys) : yss
| otherwise = ys : insert x yss
select [] = []
select ([x] : _) = [x]
select ((_ : _) : xss) = select xss
So, first we traverse the input list (collect) while maintaining a list of buckets of equal elements that we update with insert. Then we simply select the first element that appears in a singleton bucket (select).
The bad news is that this takes quadratic time: for every visited element in collect we need to go over the list of buckets. I am afraid that is the price you will have to pay for only being able to constrain the element type to be in Eq.
Something like this look pretty good.
unique = fst . foldl' (\(a, b) c -> if (c `elem` b)
then (a, b)
else if (c `elem` a)
then (delete c a, c:b)
else (c:a, b)) ([],[])
The first element of the resulted tuple of the fold, contain what you are expecting, a list containing unique element. The second element of the tuple is the memory of the process remembered if an element has already been discarded or not.
About space performance.
As your problem is design, all the element of the list should be traversed at least one time, before a result can be display. And the internal algorithm must keep trace of discarded value in addition to the good one, but discarded value will appears only one time. Then in the worst case the required amount of memory is equal to the size of the inputted list. This sound goods as you said that expected input are small.
About time performance.
As the expected input are small and not sorted by default, trying to sort the list into the algorithm is useless, or before to apply it is useless. In fact statically we can almost said, that the extra operation to place an element at its ordered place (into the sub list a and b of the tuple (a,b)) will cost the same amount of time than to check if this element appear into the list or not.
Below a nicer and more explicit version of the foldl' one.
import Data.List (foldl', delete, elem)
unique :: Eq a => [a] -> [a]
unique = fst . foldl' algorithm ([], [])
where
algorithm (result0, memory0) current =
if (current `elem` memory0)
then (result0, memory0)
else if (current`elem` result0)
then (delete current result0, memory)
else (result, memory0)
where
result = current : result0
memory = current : memory0
Into the nested if ... then ... else ... instruction the list result is traversed twice in the worst case, this can be avoid using the following helper function.
unique' :: Eq a => [a] -> [a]
unique' = fst . foldl' algorithm ([], [])
where
algorithm (result, memory) current =
if (current `elem` memory)
then (result, memory)
else helper current result memory []
where
helper current [] [] acc = ([current], [])
helper current [] memory acc = (acc, memory)
helper current (r:rs) memory acc
| current == r = (acc ++ rs, current:memory)
| otherwise = helper current rs memory (r:acc)
But the helper can be rewrite using fold as follow, which is definitely nicer.
helper current [] _ = ([current],[])
helper current memory result =
foldl' (\(r, m) x -> if x==current
then (r, current:m)
else (current:r, m)) ([], memory) $ result

sorting integers fast in haskell

Is there any function in haskell libraries that sorts integers in O(n) time?? [By, O(n) I mean faster than comparison sort and specific for integers]
Basically I find that the following code takes a lot of time with the sort (as compared to summing the list without sorting) :
import System.Random
import Control.DeepSeq
import Data.List (sort)
genlist gen = id $!! sort $!! take (2^22) ((randoms gen)::[Int])
main = do
gen <- newStdGen
putStrLn $ show $ sum $ genlist gen
Summing a list doesn't require deepseq but what I am trying for does, but the above code is good enough for the pointers I am seeking.
Time : 6 seconds (without sort); about 35 seconds (with sort)
Memory : about 80 MB (without sort); about 310 MB (with sort)
Note 1 : memory is a bigger issue than time for me here as for the task at hand I am getting out of memory errors (memory usage becomes 3GB! after 30 minutes of run-time)
I am assuming faster algorithms will provide bettor memory print too, hence looking for O(n) time.
Note 2 : I am looking for fast algorithms for Int64, though fast algorithms for other specific types will also be helpful.
Solution Used : IntroSort with unboxed vectors was good enough for my task:
import qualified Data.Vector.Unboxed as V
import qualified Data.Vector.Algorithms.Intro as I
sort :: [Int] -> [Int]
sort = V.toList . V.modify I.sort . V.fromList
I would consider using vectors instead of lists for this, as lists have a lot of overhead per-element while an unboxed vector is essentially just a contiguous block of bytes. The vector-algorithms package contains various sorting algorithms you can use for this, including radix sort, which I expect should do well in your case.
Here's a simple example, though it might be a good idea to keep the result in vector form if you plan on doing further processing on it.
import qualified Data.Vector.Unboxed as V
import qualified Data.Vector.Algorithms.Radix as R
sort :: [Int] -> [Int]
sort = V.toList . V.modify R.sort . V.fromList
Also, I suspect that a significant portion of the run time of your example is coming from the random number generator, as the standard one isn't exactly known for its performance. You should make sure that you're timing only the sorting part, and if you need a lot of random numbers in your program, there are faster generators available on Hackage.
The idea to sort the numbers using an array is the right one for reducing the memory usage.
However, using the maximum and minimum of the list as bounds may cause exceeding memory usage or even a runtime failure when maximum xs - minimum xs > (maxBound :: Int).
So I suggest writing the list contents to an unboxed mutable array, sorting that inplace (e.g. with quicksort), and then building a list from that again.
import System.Random
import Control.DeepSeq
import Data.Array.Base (unsafeRead, unsafeWrite)
import Data.Array.ST
import Control.Monad.ST
myqsort :: STUArray s Int Int -> Int -> Int -> ST s ()
myqsort a lo hi
| lo < hi = do
let lscan p h i
| i < h = do
v <- unsafeRead a i
if p < v then return i else lscan p h (i+1)
| otherwise = return i
rscan p l i
| l < i = do
v <- unsafeRead a i
if v < p then return i else rscan p l (i-1)
| otherwise = return i
swap i j = do
v <- unsafeRead a i
unsafeRead a j >>= unsafeWrite a i
unsafeWrite a j v
sloop p l h
| l < h = do
l1 <- lscan p h l
h1 <- rscan p l1 h
if (l1 < h1) then (swap l1 h1 >> sloop p l1 h1) else return l1
| otherwise = return l
piv <- unsafeRead a hi
i <- sloop piv lo hi
swap i hi
myqsort a lo (i-1)
myqsort a (i+1) hi
| otherwise = return ()
genlist gen = runST $ do
arr <- newListArray (0,2^22-1) $ take (2^22) (randoms gen)
myqsort arr 0 (2^22-1)
let collect acc 0 = do
v <- unsafeRead arr 0
return (v:acc)
collect acc i = do
v <- unsafeRead arr i
collect (v:acc) (i-1)
collect [] (2^22-1)
main = do
gen <- newStdGen
putStrLn $ show $ sum $ genlist gen
is reasonably fast and uses less memory. It still uses a lot of memory for the list, 222 Ints take 32MB storage raw (with 64-bit Ints), with the list overhead of iirc five words per element, that adds up to ~200MB, but less than half of the original.
This is taken from Richard Bird's book, Pearls of Functional Algorithm Design, (though I had to edit it a little, as the code in the book didn't compile exactly as written).
import Data.Array(Array,accumArray,assocs)
sort :: [Int] -> [Int]
sort xs = concat [replicate k x | (x,k) <- assocs count]
where count :: Array Int Int
count = accumArray (+) 0 range (zip xs (repeat 1))
range = (0, maximum xs)
It works by creating an Array indexed by integers where the values are the number of times each integer occurs in the list. Then it creates a list of the indexes, repeating them the same number of times they occurred in the original list according to the counts.
You should note that it is linear with the maximum value in the list, not the length of the list, so a list like [ 2^x | x <- [0..n] ] would not be sorted linearly.

Select random element from a set, faster than linear time (Haskell)

I'd like to create this function, which selects a random element from a Set:
randElem :: (RandomGen g) => Set a -> g -> (a, g)
Simple listy implementations can be written. For example (code updated, verified working):
import Data.Set as Set
import System.Random (getStdGen, randomR, RandomGen)
randElem :: (RandomGen g) => Set a -> g -> (a, g)
randElem s g = (Set.toList s !! n, g')
where (n, g') = randomR (0, Set.size s - 1) g
-- simple test drive
main = do g <- getStdGen
print . fst $ randElem s g
where s = Set.fromList [1,3,5,7,9]
But using !! incurs a linear lookup cost for large (randomly selected) n. Is there a faster way to select a random element in a Set? Ideally, repeated random selections should produce a uniform distribution over all options, meaning it does not prefer some elements over others.
Edit: some great ideas are popping up in the answers, so I just wanted to throw a couple more clarifications on what exactly I'm looking for. I asked this question with Sets as the solution to this situation in mind. I'll prefer answers that both
avoid using any outside-the-function bookkeeping beyond the Set's internals, and
maintain good performance (better than O(n) on average) even though the function is only used once per unique set.
I also have this love of working code, so expect (at minimum) a +1 from me if your answer includes a working solution.
Data.Map has an indexing function (elemAt), so use this:
import qualified Data.Map as M
import Data.Map(member, size, empty)
import System.Random
type Set a = M.Map a ()
insert :: (Ord a) => a -> Set a -> Set a
insert a = M.insert a ()
fromList :: Ord a => [a] -> Set a
fromList = M.fromList . flip zip (repeat ())
elemAt i = fst . M.elemAt i
randElem :: (RandomGen g) => Set a -> g -> (a, g)
randElem s g = (elemAt n s, g')
where (n, g') = randomR (0, size s - 1) g
And you have something quite compatible with Data.Set (with respect to interface and performance) that also has a log(n) indexing function and the randElem function you requested.
Note that randElem is log(n) (and it's probably the fastest implementation you can get with this complexity), and all the other functions have the same complexity as in Data.Set. Let me know if you need any other specific functions from the Set API and I will add them.
As far as I know, the proper solution would be to use an indexed set -- i.e. an IntMap. You just need to store the total number of elements added along with the map. Every time you add an element, you add it with a key one higher than previously. Deleting an element is fine -- just don't alter the total elements counter. If, on looking up a keyed element, that element no longer exists, then generate a new random number and try again. This works until the total number of deletions dominates the number of active elements in the set. If that's a problem, you can keep a separate set of deleted keys to draw from when inserting new elements.
Here's an idea: You could do interval bisection.
size s is constant time. Use randomR to get how far into the set you are selecting.
Do split with various values between the original findMin and findMax until you get the element at the position you want. If you really fear that the set is made up say of reals and is extremely tightly clustered, you can recompute findMin and findMax each time to guarantee knocking off some elements each time.
The performance would be O(n log n), basically no worse than your current solution, but with only rather weak conditions to the effect that the set not be entirely clustered round some accumulation point, the average performance should be ~((logn)^2), which is fairly constant. If it's a set of integers, you get O(log n * log m), where m is the initial range of the set; it's only reals that could cause really nasty performance in an interval bisection (or other data types whose order-type has accumulation points).
PS. This produces a perfectly even distribution, as long as watching for off-by-ones to make sure it's possible to get the elements at the top and bottom.
Edit: added 'code'
Some inelegant, unchecked (pseudo?) code. No compiler on my current machine to smoke test, possibility of off-by-ones, and could probably be done with fewer ifs. One thing: check out how mid is generated; it'll need some tweaking depending on whether you are looking for something that works with sets of ints or reals (interval bisection is inherently topological, and oughtn't to work quite the same for sets with different topologies).
import Data.Set as Set
import System.Random (getStdGen, randomR, RandomGen)
getNth (s, n) = if n = 0 then (Set.findMin s) else if n + 1 = Set.size s then Set.findMax s
else if n < Set.size bott then getNth (bott, n) else if pres and Set.size bott = n then n
else if pres then getNth (top, n - Set.size bott - 1) else getNth (top, n - Set.size)
where mid = ((Set.findMax s) - (Set.findMin s)) /2 + (Set.findMin s)
(bott, pres, top) = (splitMember mid s)
randElem s g = (getNth(s, n), g')
where (n, g') = randomR (0, Set.size s - 1) g
As of containers-0.5.2.0 the Data.Set module has an elemAt function, which retrieves values by their zero-based index in the sorted sequence of elements. So it is now trivial to write this function
import Control.Monad.Random
import Data.Set (Set)
import qualified Data.Set as Set
randElem :: (MonadRandom m, Ord a) -> Set a -> m (a, Set a)
randElem xs = do
n <- getRandomR (0, Set.size xs - 1)
return (Set.elemAt n xs, Set.deleteAt n xs)
Since both Set.elemAt and Set.deleteAt are O(log n) where n is the number of elements in the set, the entire operation is O(log n)
If you had access to the internals of Data.Set, which is just a binary tree, you could recurse over the tree, at each node selecting one of the branches with probability according to their respective sizes. This is quite straight forward and gives you very good performance in terms of memory management and allocations, as you have no extra book-keeping to do. OTOH, you have to invoke the RNG O(log n) times.
A variant is using Jonas’ suggestion to first take the size and select the index of the random element based on that, and then use a (yet to be added elemAt) function to Data.Set.
If you don't need to modify your set or need to modify it infrequently you can use arrays as lookup table with O(1) access time.
import qualified Data.Vector
import qualified Data.Set
newtype RandSet a = RandSet (V.Vector a)
randElem :: RandSet a -> RandomGen -> (a, RandomGen)
randElem (RandSet v) g
| V.empty v = error "Cannot select from empty set"
| otherwise =
let (i,g') = randomR (0, V.length v - 1) g
in (v ! i, g')
-- Of course you have to rebuild array on insertion/deletion which is O(n)
insert :: a -> RandSet a -> RandSet a
insert x = V.fromList . Set.toList . Set.insert x . Set.fromList . V.toList`
This problem can be finessed a bit if you don't mind completely consuming your RandomGen. With splittable generators, this is an A-OK thing to do. The basic idea is to make a lookup table for the set:
randomElems :: Set a -> RandomGen -> [a]
randomElems set = map (table !) . randomRs bounds where
bounds = (1, size set)
table = listArray bounds (toList set)
This will have very good performance: it will cost you O(n+m) time, where n is the size of the set and m is the number of elements of the resulting list you evaluate. (Plus the time it takes to randomly choose m numbers in bounds, of course.)
Another way to achieve this might be to use Data.Sequence instead of Data.Set. This would allow you to add elements to the end in O(1) time and index elements in O(log n) time. If you also need to be able to do membership tests or deletions, you would have to use the more general fingertree package and use something like FingerTree (Sum 1, Max a) a. To insert an element, use the Max a annotation to find the right place to insert; this basically takes O(log n) time (for some usage patterns it might be a bit less). To do a membership test, do basically the same thing, so it's O(log n) time (again, for some usage patterns this might be a bit less). To pick a random element, use the Sum 1 annotation to do your indexing, taking O(log n) time (this will be the average case for uniformly random indices).
I think maybe a lot of the answers are old here because now the set module has its own elemAt function to do this sort of thing very easily. Here's my implementation:
import Data.Set(Set)
import qualified Data.Set as S
import System.Random
randomMember :: RandomGen g => Set a -> g -> (Maybe a, g)
randomMember s g | null s = (Nothing, g)
| otherwise = let (a, g') = randomR (0, length s) g
in (Just (S.elemAt a s), g')
Prelude functions null and length work on sets as Set is an instance of Foldable.
hskl> newStdGen >>= return . fst . randomMember (S.fromList [1..10])
Just 7
hskl> newStdGen >>= return . fst . randomMember (S.fromList [1..10])
Just 4
hskl> newStdGen >>= return . fst . randomMember (S.fromList [1..10])
Just 4
hskl> newStdGen >>= return . fst . randomMember (S.fromList [1..10])
Just 9
hskl> newStdGen >>= return . fst . randomMember (S.fromList [1..10])
Just 8
hskl> newStdGen >>= return . fst . randomMember S.empty
Nothing
hskl> newStdGen >>= return . fst . randomMember S.empty
Nothing

Resources