fast and concurrent algorithm of frequency calculation in elixir - algorithm

I have two big lists that their item's lengths isn't constant. Each list include millions items.
And I want to count frequency of items of first list in second list!
For example:
a = [[c, d], [a, b, e]]
b = [[a, d, c], [e, a, b], [a, d], [c, d, a]]
# expected result of calculate_frequency(a, b) is %{[c, d] => 2, [a, b, e] => 1} Or [{[c, d], 2}, {[a, b, e], 1}]
Due to the large size of the lists, I would like this process to be done concurrently.
So I wrote this function:
def calculate_frequency(items, data_list) do
items
|> Task.async_stream(
fn item ->
frequency =
data_list
|> Enum.reduce(0, fn data_row, acc ->
if item -- data_row == [] do
acc + 1
else
acc
end
end)
{item, frequency}
end,
ordered: false
)
|> Enum.reduce([], fn {:ok, merged}, merged_list -> [merged | merged_list] end)
end
But this algorithm is slow. What should I do to make it fast?
PS: Please do not consider the type of inputs and outputs, the speed of execution is important.

Not sure if this fast enough and certainly it's not concurrent. It's O(m + n) where m is the size of items and n is the size of data_list. I can't find a faster concurrent way because combining the result of all the sub-processes also takes time.
data_list
|> Enum.reduce(%{}, fn(item, counts)->
Map.update(counts, item, 1, &(&1 + 1))
end)
|> Map.take(items)
FYI, doing things concurrently does not necessarily mean doing things in parallel. If you have only one CPU core, concurrency actually slows things down because one CPU core can only do one thing at a time.

Put one list into a MapSet.
Go through the second list and see whether or not each element is in the MapSet.
This is linear in the lengths of the lists, and both operations should be able to be parallelized.

I would start by normalizing the data you want to compare so a simple equality check can tell if two items are "equal" as you would define it. Based on your code, I would guess Enum.sort/1 would do the trick, though MapSet.new/1 or a function returning a map may compare faster if it matches your use case.
defp normalize(item) do
Enum.sort(item)
end
def calculate_frequency(items, data_list) do
data_list = Enum.map(data_list, &normalize/1)
items = Enum.map(items, &normalize/1)
end
If you're going to get most frequencies from data list, I would then calculate all frequencies for data list. Elixir 1.10 introduced Enum.frequencies/1 and Enum.frequencies_by/2, but you could do this with a reduce if desired.
def calculate_frequency(items, data_list) do
data_frequencies = Enum.frequencies_by(data_list, &normalize/1) # does map for you
Map.new(items, &Map.get(data_frequencies, normalize(&1), 0)) # if you want result as map
end
I haven't done any benchmarks on my code or yours. If you were looking to do more asynchronous stuff, you could replace your mapping with Task.async_stream/3, and you could replace your frequencies call with a combination of Stream.chunk_every/2, Task.async_stream/3 (with Enum.frequencies/1 being the function), and Map.merge/3.

Related

Algorithm to find list given dot product and another list

I need to write a function findL that takes a list L1 of integers and a desired dot product n, and returns a list L2 of nonnegative integers such that L1 · L2 = n. (By "dot product" I mean the sum of the pairwise products; for example, [1,2] · [3,4] = 1·3+2·4 = 11.)
So, for example, findL(11, [1,2]) might return SOME [3,4]. If there's no possible list, I return NONE.
I'm using a functional language. (Specifically Standard ML, but the exact language isn't so important, I'm just trying to think of an FP algorithm.) What I have written so far:
Let's say I have findL(n, L1):
if L1 = [], I return NONE.
if L1 = [x] (list of length 1)
if (n >= 0 and x > 0 and n mod x = 0), return SOME [n div x]
else return NONE
If L1 has length greater than 1, I recurse on findL (n, L[1:]). If that returns a list L2, I return [1] concatenated to L2. If the recursive call returns NONE, I did another recursive call on findL (0, L[1:]) and prepended [n div x] to the result if it wasn't NONE. This works on many inputs but are failing on others.
I need to change part 3, but I'm not sure if I have the right idea. I would appreciate any tips!
Unless you need to say that empty lists in the input are always bad (even n = 0 with the list []), I'd recommend returning something different for an empty list based on whether you've reached 0 at the end (everything has been subtracted away) or not, then recurse when receiving any nonempty list rather than special-casing a one-element list.
As far as step three, you need to test every possible positive integer multiple of the first element of your input list until they exceed n, not just the first and last. The first non-None value you get is good enough, so you just prepend the multiplier (not the multiple) to the return list. If everything gives you Nones, you return None.
I don't know SML, but here's how I'd do it in Haskell:
import Data.Maybe (isJust, listToMaybe)
-- Find linear combinations of positive integers
solve :: Integer -> [Integer] -> Maybe [Integer]
-- If we've made it to the end with zero left, good!
solve 0 [] = Just []
-- Otherwise, this way isn't the way to go.
solve _ [] = Nothing
-- If one of the elements of the input list is zero, just multiply that element by one.
solve n (0:xs) = case solve n xs of
Nothing -> Nothing
Just ys -> Just (1:ys)
solve n (x:xs) = listToMaybe -- take first solution if it exists
. map (\ (m, Just ys) -> m:ys) -- put multiplier at front of list
. filter (isJust . snd) -- remove nonsolutions
. zip [1 ..] -- tuple in the multiplier
. map (\ m -> solve (n - m) xs) -- use each multiple
$ [x, x + x .. n] -- the multiples of x up to n
Here it is solving 11 with [1, 2] and 1 with [1, 2].

Is `[<var> in <distributed variable>]` equivalent to `forall`?

I noticed something in a snippet of code I was given:
var D: domain(2) dmapped Block(boundingBox=Space) = Space;
var A: [D] int;
[a in A] a = a.locale.id;
Is [a in A] equivalent to forall a in A a = a.locale.id?
For the most part, yes. In Chapel, [a in A] expr can be thought of as a shorthand for forall a in A do expr. However, there is a slight difference in that if A does not support parallel iteration, the forall form will generate a compile-time error whereas the [a in A] form will fall back to serial iteration.
With respect to the title of this question, note that this behavior is independent of whether or not A is distributed. For example, you could also write [i in 1..n] rather than forall i in 1..n do even though ranges like 1..n are never distributed in Chapel.
Array types in Chapel, like [D] real can similarly be read as "for all indices in D, allocate an element of type real."

Adding 2 Int Lists Together F#

I am working on homework and the problem is where we get 2 int lists of the same size, and then add the numbers together. Example as follows.
vecadd [1;2;3] [4;5;6];; would return [5;7;9]
I am new to this and I need to keep my code pretty simple so I can learn from it. I have this so far. (Not working)
let rec vecadd L K =
if L <> [] then vecadd ((L.Head+K.Head)::L) K else [];;
I essentially want to just replace the first list (L) with the added numbers. Also I have tried to code it a different way using the match cases.
let rec vecadd L K =
match L with
|[]->[]
|h::[]-> L
|h::t -> vecadd ((h+K.Head)::[]) K
Neither of them are working and I would appreciate any help I can get.
First, your idea about modifying the first list instead of returning a new one is misguided. Mutation (i.e. modifying data in place) is the number one reason for bugs today (used to be goto, but that's been banned for a long time now). Making every operation produce a new datum rather than modify existing ones is much, much safer. And in some cases it may be even more performant, quite counterintuitively (see below).
Second, the way you're trying to do it, you're not doing what you think you're doing. The double-colon doesn't mean "modify the first item". It means "attach an item in front". For example:
let a = [1; 2; 3]
let b = 4 :: a // b = [4; 1; 2; 3]
let c = 5 :: b // c = [5; 4; 1; 2; 3]
That's how lists are actually built: you start with a empty list and prepend items to it. The [1; 2; 3] syntax you're using is just a syntactic sugar for that. That is, [1; 2; 3] === 1::2::3::[].
So how do I modify a list, you ask? The answer is, you don't! F# lists are immutable data structures. Once you've created a list, you can't modify it.
This immutability allows for an interesting optimization. Take another look at the example I posted above, the one with three lists a, b, and c. How many cells of memory do you think these three lists occupy? The first list has 3 items, second - 4, and third - 5, so the total amount of memory taken must be 12, right? Wrong! The total amount of memory taken up by these three lists is actually just 5 cells. This is because list b is not a block of memory of length 4, but rather just the number 4 paired with a pointer to the list a. The number 4 is called "head" of the list, and the pointer is called its "tail". Similarly, the list c consists of one number 5 (its "head") and a pointer to list b, which is its "tail".
If lists were not immutable, one couldn't organize them like this: what if somebody modifies my tail? Lists would have to be copied every time (google "defensive copy").
So the only way to do with lists is to return a new one. What you're trying to do can be described like this: if the input lists are empty, the result is an empty list; otherwise, the result is the sum of tails prepended with the sum of heads. You can write this down in F# almost verbatim:
let rec add a b =
match a, b with
| [], [] -> [] // sum of two empty lists is an empty list
| a::atail, b::btail -> (a + b) :: (add atail btail) // sum of non-empty lists is sum of their tails prepended with sum of their heads
Note that this program is incomplete: it doesn't specify what the result should be when one input is empty and the other is not. The compiler will generate a warning about this. I'll leave the solution as an exercise for the reader.
You can map over both lists together with List.map2 (see the docs)
It goes over both lists pairwise and you can give it a function (the first parameter of List.map2) to apply to every pair of elements from the lists. And that generates the new list.
let a = [1;2;3]
let b = [4;5;6]
let vecadd = List.map2 (+)
let result = vecadd a b
printfn "%A" result
And if you want't to do more work 'yourself' something like this?
let a = [1;2;3]
let b = [4;5;6]
let vecadd l1 l2 =
let rec step l1 l2 acc =
match l1, l2 with
| [], [] -> acc
| [], _ | _, [] -> failwithf "one list is bigger than the other"
| h1 :: t1, h2 :: t2 -> step t1 t2 (List.append acc [(h1 + h2)])
step l1 l2 []
let result = vecadd a b
printfn "%A" result
The step function is a recursive function that takes two lists and an accumulator to carry the result.
In the last match statement it does three things
Sum the head of both lists
Add the result to the accumulator
Recursively call itself with the new accumulator and the tails of the lists
The first match returns the accumulator when the remaining lists are empty
The second match returns an error when one of the lists is longer than the other.
The accumulator is returned as the result when the remaining lists are empty.
The call step l1 l2 [] kicks it off with the two supplied lists and an empty accumulator.
I have done this for crossing two lists (multiply items with same index together):
let items = [1I..50_000I]
let another = [1I..50_000I]
let rec cross a b =
let rec cross_internal = function
| r, [], [] -> r
| r, [], t -> r#t
| r, t, [] -> r#t
| r, head::t1, head2::t2 -> cross_internal(r#[head*head2], t1, t2)
cross_internal([], a, b)
let result = cross items another
result |> printf "%A,"
Note: not really performant. There are list object creations at each step which is horrible. Ideally the inner function cross_internal must create a mutable list and keep updating it.
Note2: my ranges were larger initially and using bigint (hence the I suffix in 50_000) but then reduced the sample code above to just 50,500 elements.

How to reverse a singly linked list recursively without modifying the pointers?

Recently I had an interview and the interviewer asked me to reverse a singly linked list without modifying the pointers(change the values only).
At the beginning I came up with a solution using a stack. He said that was OK and wanted me to do it recursively. Then I gave him a O(n^2) solution. But he said he needs a O(n) solution.
Anyone can help me?
Pseudocode
reverse (list):
reverse2 (list, list)
reverse2 (a, b):
if b != nil:
a = reverse2 (a, b.next)
if a != nil:
swap (a.data, b.data)
if a == b || a.next == b:
# we've reached the middle of the list, tell the caller to stop
return nil
else:
return a.next
else:
# the recursive step has returned nil, they don't want us to do anything
return nil
else:
# we've reached the end of the list, start working!
return a
One way I can think of doing it is recursing to the end accumulating the values in another list as you resurse to the end, then on the way out of the recursion writing the values back starting with the 1st value in the list. It would be O(2n). It's not much different from using a stack...
list = { a => b => c => d }
def recursive(acc, x)
if !x
return acc
end
acc.preprend(x)
return recursive(acc, x.next)
end
result = recursive([], list.first)
So first call is recursive([], a). result is now [a].
Second call is recursive([a], b). result turns into [b, a].
Third call is recursive([b, a], c). result is [c, b, a].
Fourth call is recursive([c, b, a], d), and result [d, c, b, a].
Fifth call gets caught by the if !x.
Tell your interviewer you need an additional structure, like someone else said above.

Finding unique (as in only occurring once) element haskell

I need a function which takes a list and return unique element if it exists or [] if it doesn't. If many unique elements exists it should return the first one (without wasting time to find others).
Additionally I know that all elements in the list come from (small and known) set A.
For example this function does the job for Ints:
unique :: Ord a => [a] -> [a]
unique li = first $ filter ((==1).length) ((group.sort) li)
where first [] = []
first (x:xs) = x
ghci> unique [3,5,6,8,3,9,3,5,6,9,3,5,6,9,1,5,6,8,9,5,6,8,9]
ghci> [1]
This is however not good enough because it involves sorting (n log n) while it could be done in linear time (because A is small).
Additionally it requires the type of list elements to be Ord while all which should be needed is Eq. It would also be nice if amount of comparisons was as small as possible (ie if we traverse a list and encounter element el twice we don't test subsequent elements for equality with el)
This is why for example this: Counting unique elements in a list doesn't solve the problem - all answers involve either sorting or traversing the whole list to find count of all elements.
The question is: how to do it correctly and efficiently in Haskell ?
Okay, linear time, from a finite domain. The running time will be O((m + d) log d), where m is the size of the list and d is the size of the domain, which is linear when d is fixed. My plan is to use the elements of the set as the keys of a trie, with the counts as values, then look through the trie for elements with count 1.
import qualified Data.IntTrie as IntTrie
import Data.List (foldl')
import Control.Applicative
Count each of the elements. This traverses the list once, builds a trie with the results (O(m log d)), then returns a function which looks up the result in the trie (with running time O(log d)).
counts :: (Enum a) => [a] -> (a -> Int)
counts xs = IntTrie.apply (foldl' insert (pure 0) xs) . fromEnum
where
insert t x = IntTrie.modify' (fromEnum x) (+1) t
We use the Enum constraint to convert values of type a to integers in order to index them in the trie. An Enum instance is part of the witness of your assumption that a is a small, finite set (Bounded would be the other part, but see below).
And then look for ones that are unique.
uniques :: (Eq a, Enum a) => [a] -> [a] -> [a]
uniques dom xs = filter (\x -> cts x == 1) dom
where
cts = counts xs
This function takes as its first parameter an enumeration of the entire domain. We could have required a Bounded a constraint and used [minBound..maxBound] instead, which is semantically appealing to me since finite is essentially Enum+Bounded, but quite inflexible since now the domain needs to be known at compile time. So I would choose this slightly uglier but more flexible variant.
uniques traverses the domain once (lazily, so head . uniques dom will only traverse as far as it needs to to find the first unique element -- not in the list, but in dom), for each element running the lookup function which we have established is O(log d), so the filter takes O(d log d), and building the table of counts takes O(m log d). So uniques runs in O((m + d) log d), which is linear when d is fixed. It will take at least Ω(m log d) to get any information from it, because it has to traverse the whole list to build the table (you have to get all the way to the end of the list to see if an element was repeated, so you can't do better than this).
There really isn't any way to do this efficiently with just Eq. You'd need to use some much less efficient way to build the groups of equal elements, and you can't know that only one of a particular element exists without scanning the whole list.
Also, note that to avoid useless comparisons you'd need a way of checking to see if an element has been encountered before, and the only way to do that would be to have a list of elements known to have multiple occurrences, and the only way to check if the current element is in that list is... to compare it for equality with each.
If you want this to work faster than O(something really horrible) you need that Ord constraint.
Ok, based on the clarifications in comments, here's a quick and dirty example of what I think you're looking for:
unique [] _ _ = Nothing
unique _ [] [] = Nothing
unique _ (r:_) [] = Just r
unique candidates results (x:xs)
| x `notElem` candidates = unique candidates results xs
| x `elem` results = unique (delete x candidates) (delete x results) xs
| otherwise = unique candidates (x:results) xs
The first argument is a list of candidates, which should initially be all possible elements. The second argument is the list of possible results, which should initially be empty. The third argument is the list to examine.
If it runs out of candidates, or reaches the end of the list with no results, it returns Nothing. If it reaches the end of the list with results, it returns the one at the front of the result list.
Otherwise, it examines the next input element: If it's not a candidate, it ignores it and continues. If it's in the result list we've seen it twice, so remove it from the result and candidate lists and continue. Otherwise, add it to the results and continue.
Unfortunately, this still has to scan the entire list for even a single result, since that's the only way to be sure it's actually unique.
First off, if your function is intended to return at most one element, you should almost certainly use Maybe a instead of [a] to return your result.
Second, at minimum, you have no choice but to traverse the entire list: you can't tell for sure if any given element is actually unique until you've looked at all the others.
If your elements are not Ordered, but can only be tested for Equality, you really have no better option than something like:
firstUnique (x:xs)
| elem x xs = firstUnique (filter (/= x) xs)
| otherwise = Just x
firstUnique [] = Nothing
Note that you don't need to filter out the duplicated elements if you don't want to -- the worst case is quadratic either way.
Edit:
The above misses the possibility of early exit due to the above-mentioned small/known set of possible elements. However, note that the worst case will still require traversing the entire list: all that is necessary is for at least one of these possible elements to be missing from the list...
However, an implementation that provides an early out in case of set exhaustion:
firstUnique = f [] [<small/known set of possible elements>] where
f [] [] _ = Nothing -- early out
f uniques noshows (x:xs)
| elem x uniques = f (delete x uniques) noshows xs
| elem x noshows = f (x:uniques) (delete x noshows) xs
| otherwise = f uniques noshows xs
f [] _ [] = Nothing
f (u:_) _ [] = Just u
Note that if your list has elements which shouldn't be there (because they aren't in the small/known set), they will be pointedly ignored by the above code...
As others have said, without any additional constraints, you can't do this in less than quadratic time, because without knowing something about the elements, you can't keep them in some reasonable data structure.
If we are able to compare elements, an obvious O(n log n) solution to compute the count of elements first and then find the first one with count equal to 1:
import Data.List (foldl', find)
import Data.Map (Map)
import qualified Data.Map as Map
import Data.Maybe (fromMaybe)
count :: (Ord a) => Map a Int -> a -> Int
count m x = fromMaybe 0 $ Map.lookup x m
add :: (Ord a) => Map a Int -> a -> Map a Int
add m x = Map.insertWith (+) x 1 m
uniq :: (Ord a) => [a] -> Maybe a
uniq xs = find (\x -> count cs x == 1) xs
where
cs = foldl' add Map.empty xs
Note that the log n factor comes from the fact that we need to operate on a Map of size n. If the list has only k unique elements then the size of our map will be at most k, so the overall complexity will be just O(n log k).
However, we can do even better - we can use a hash table instead of a map to get an O(n) solution. For this we'll need the ST monad to perform mutable operations on the hash map, and our elements will have to be Hashable. The solution is basically the same as before, just a little bit more complex due to working within the ST monad:
import Control.Monad
import Control.Monad.ST
import Data.Hashable
import qualified Data.HashTable.ST.Basic as HT
import Data.Maybe (fromMaybe)
count :: (Eq a, Hashable a) => HT.HashTable s a Int -> a -> ST s Int
count ht x = liftM (fromMaybe 0) (HT.lookup ht x)
add :: (Eq a, Hashable a) => HT.HashTable s a Int -> a -> ST s ()
add ht x = count ht x >>= HT.insert ht x . (+ 1)
uniq :: (Eq a, Hashable a) => [a] -> Maybe a
uniq xs = runST $ do
-- Count all elements into a hash table:
ht <- HT.newSized (length xs)
forM_ xs (add ht)
-- Find the first one with count 1
first (\x -> liftM (== 1) (count ht x)) xs
-- Monadic variant of find which exists once an element is found.
first :: (Monad m) => (a -> m Bool) -> [a] -> m (Maybe a)
first p = f
where
f [] = return Nothing
f (x:xs') = do
b <- p x
if b then return (Just x)
else f xs'
Notes:
If you know that there will be only a small number of distinct elements in the list, you could use HT.new instead of HT.newSized (length xs). This will save you some memory and one pass over xs but in the case of many distinct elements the hash table will be have to resized several times.
Here is a version that does the trick:
unique :: Eq a => [a] -> [a]
unique = select . collect []
where
collect acc [] = acc
collect acc (x : xs) = collect (insert x acc) xs
insert x [] = [[x]]
insert x (ys#(y : _) : yss)
| x == y = (x : ys) : yss
| otherwise = ys : insert x yss
select [] = []
select ([x] : _) = [x]
select ((_ : _) : xss) = select xss
So, first we traverse the input list (collect) while maintaining a list of buckets of equal elements that we update with insert. Then we simply select the first element that appears in a singleton bucket (select).
The bad news is that this takes quadratic time: for every visited element in collect we need to go over the list of buckets. I am afraid that is the price you will have to pay for only being able to constrain the element type to be in Eq.
Something like this look pretty good.
unique = fst . foldl' (\(a, b) c -> if (c `elem` b)
then (a, b)
else if (c `elem` a)
then (delete c a, c:b)
else (c:a, b)) ([],[])
The first element of the resulted tuple of the fold, contain what you are expecting, a list containing unique element. The second element of the tuple is the memory of the process remembered if an element has already been discarded or not.
About space performance.
As your problem is design, all the element of the list should be traversed at least one time, before a result can be display. And the internal algorithm must keep trace of discarded value in addition to the good one, but discarded value will appears only one time. Then in the worst case the required amount of memory is equal to the size of the inputted list. This sound goods as you said that expected input are small.
About time performance.
As the expected input are small and not sorted by default, trying to sort the list into the algorithm is useless, or before to apply it is useless. In fact statically we can almost said, that the extra operation to place an element at its ordered place (into the sub list a and b of the tuple (a,b)) will cost the same amount of time than to check if this element appear into the list or not.
Below a nicer and more explicit version of the foldl' one.
import Data.List (foldl', delete, elem)
unique :: Eq a => [a] -> [a]
unique = fst . foldl' algorithm ([], [])
where
algorithm (result0, memory0) current =
if (current `elem` memory0)
then (result0, memory0)
else if (current`elem` result0)
then (delete current result0, memory)
else (result, memory0)
where
result = current : result0
memory = current : memory0
Into the nested if ... then ... else ... instruction the list result is traversed twice in the worst case, this can be avoid using the following helper function.
unique' :: Eq a => [a] -> [a]
unique' = fst . foldl' algorithm ([], [])
where
algorithm (result, memory) current =
if (current `elem` memory)
then (result, memory)
else helper current result memory []
where
helper current [] [] acc = ([current], [])
helper current [] memory acc = (acc, memory)
helper current (r:rs) memory acc
| current == r = (acc ++ rs, current:memory)
| otherwise = helper current rs memory (r:acc)
But the helper can be rewrite using fold as follow, which is definitely nicer.
helper current [] _ = ([current],[])
helper current memory result =
foldl' (\(r, m) x -> if x==current
then (r, current:m)
else (current:r, m)) ([], memory) $ result

Resources