Most commonly occurring combination - algorithm

I have a list of integer array, where each array have some numbers sorted. Here I want to find the most commonly occurring combination of sequence of integers based on all the array. For example if the list of array is as follows
A1 - 1 2 3 5 7 8
A2 - 2 3 5 6 7
A3 - 3 5 7 9
A4 - 1 2 3 7 9
A5 - 3 5 7 10
Here
{3,5,7} - {A1,A3,A5}
{2,3} - {A1,A2,A4}
So we can take {3,5,7} or {2,3} as the most commonly occurring combinations.
Now the algorithm i used is as following
Find intersection of a set with all others. And store the resulting set. Increment a resulting set occurrence in case if its already exist.
for eg :
Find intersections of all the below
A1 intersection A2
A1 intersection A3
A1 intersection A4
A1 intersection A5
A2 intersection A3
A2 intersection A4
A2 intersection A5
A3 intersection A4
A3 intersection A5
A4 intersection A5
Here A1 intersection A3 is same as A3 intersection A5 , hence set-{3,5,7} occurrence can be set as 2.
Similarly each resulting set occurrence can be determined.
But this algorithm demands O(n^2) complexity.
Assuming each set is sorted , am pretty sure that we can find a better algorithm with O(n) complexity which i am not able to pen down.
Can anyone suggest a O(n) algorithm for the same.

If you have a sequence of length n, then its prefix is of length n-1 and occurs at least as often - a degenerate case is the most common character, which is a sequence of length 1 that occurs at least as often as any longer sequence. Do you have a minimum suffix length you are interested in?
Regardless of this, one idea is to concatenate all of the sequences, separating them by different integers which appear nowhere else, and then compute the http://en.wikipedia.org/wiki/Suffix_array in linear time. One pass through the suffix array should allow you to find the most common subsequence of any given length - and it shouldn't cross the gap between two different arrays, because each such sequence of length n is unique, because the characters separating the arrays are unique. (see also the http://en.wikipedia.org/wiki/LCP_array)

This example in Haskell does not scan intersections. Rather, it lists the sub-sequences for each list and aggregates them into an array indexed by sub-sequence. To look up the most commonly occurring sub-sequence, simply show the longest element in the array. The output is filtered to show sub-sequences greater than length 1. Output is a list of tuples showing the sub-sequence and indexes of the lists where the sub-sequence appears:
*Main> combFreq [[1,2,3,5,7,8],[2,3,5,6,7],[3,5,7,9],[1,2,3,7,9],[3,5,7,10]]
[([3,5],[4,2,1,0]),([5,7],[4,2,0]),([3,5,7],[4,2,0]),([2,3],[3,1,0]),([7,9],[3,2]),([2,3,5],[1,0]),([1,2,3],[3,0]),([1,2],[3,0])]
import Data.List
import qualified Data.Map as M
import Data.Function (on)
sInt xs = concat $ zipWith (\x y -> zip (subs x) (repeat y)) xs [0..]
where subs = filter (not . null) . concatMap inits . tails
res xs = foldl' (\x y -> M.insertWith (++) (fst y) [snd y] x) M.empty (sInt xs)
combFreq xs = reverse $ sortBy (compare `on` (length . snd))
. filter (not . null . drop 1 . snd)
. filter (not . null . drop 1 . fst)
. M.toList
. res $ xs

Related

Modular run-length encoding

Question
How to implement a run length encoding modulus n>=1? For n=4, considering the inputAAABBBBABCCCCBBBDAAA, we want an output of [('D', 1), ('A', 3)]. Note the long-distance merging due to the modulus operation. See Explanation.
Explanation
The first occurance of BBBB encodes to (B, 4) whose modulus 4 is (B, 0), thus canceling itself out. See the diagram (ignore spaces; they are simply for illustrative purposes):
AAABBBBABCCCCBBBDAAA
A3 B4 ABCCCCBBBDAAA
A3 B0 ABCCCCBBBDAAA
A3 ABCCCCBBBDAAA
A4 BCCCCBBBDAAA
A0 BCCCCBBBDAAA
BCCCCBBBDAAA
...
DA3
A simpler example when no merging happens since none gets canceled by modulus 4: input AAABABBBC produces output [('A',3),('B',1),('A',1),('B',3),('C',1)].
Requirements
Haskell implementations are preferred but others are welcome too!
Prefer standard/common library functions over 3rd party libraries.
Prefer readable and succint programs utilizing higher-order functions.
Prefer efficiency (do not loop over the whole list whenever unnecessary)
My program
I implemented this in Haskell, but it looks too verbose and awful to read. The key idea is to check three tuples at a time, and only advance one tuple forward if we can neither cancel out 0 tuples nor merge a pair of tuples among the three tuples at hand.
import Data.List (group)
test = [('A', 1), ('A', 2), ('B', 2), ('B', 2), ('A', 1), ('B', 1), ('C', 1), ('C', 3), ('B', 3), ('D', 1), ('A', 3)] :: [(Char, Int)]
expected = [('D', 1), ('A', 3)] :: [(Char, Int)]
reduce' :: [(Char, Int)] -> [(Char, Int)]
reduce' [ ] = [] -- exit
reduce' ( (_,0):xs) = reduce' xs
reduce' (x1:(_,0):xs) = reduce' (x1:xs)
reduce' ( (x,n):[]) = (x,n):[] -- exit
reduce' ( (x1,n1):(x2,n2):[]) -- [previous,current,NONE]
| x1 == x2 = reduce' ((x1, d4 (n1+n2)):[])
| otherwise = (x1,n1):( -- advance
reduce' ((x2, d4 n2 ):[]))
reduce' ((x1,n1):(x2,n2):(x3,n3):xs) -- [previous,current,next]
| n3 == 0 = reduce' ((x1, d4 n1 ):(x2, d4 n2 ):xs)
| n2 == 0 = reduce' ((x1, d4 n1 ):(x3, d4 n3 ):xs)
| x2 == x3 = reduce' ((x1, d4 n1 ):(x2, d4 (n2+n3)):xs)
| x1 == x2 = reduce' ((x2, d4 (n1+n2)):(x3, d4 n3 ):xs)
| otherwise = (x1,n1):( -- advance
reduce' ((x2, d4 n2 ):(x3, d4 n3 ):xs)
)
-- Helpers
flatten :: [(Char, Int)] -> String
flatten nested = concat $ (\(x, n) -> replicate n x) <$> nested
nest :: String -> [(Char, Int)]
nest flat = zip (head <$> xg) (d4 .length <$> xg)
where xg = group flat
reduce = reduce' . nest
d4 = (`rem` 4)
Thoughts
My inputs are like the test variable in the snipped above. We could keep doing flatten then nest until its result doesn't change, and would definitely look simpler. But it feels it is scanning the whole list many times, while my 3-pointer implementation scans the whole list only once. Maybe we can pop an element from left and add it to a new stack while merging identical consecutive items? Or maybe use Applicative Functors? E.g. this works but not sure about its efficiency/performance: reduce = (until =<< ((==) =<<)) (nest . flatten).
Algorithm
I think you are making this problem much harder by thinking of it in terms of character strings at all. Instead, do a preliminary pass that just does the boring RLE part. This way, a second pass is comparatively easy, because you can work in "tokens" that represent a run of a certain length, instead of having to work one character at a time.
The only data structure we need to maintain as we do the second pass through the list is a stack, and we only ever need to look at its top element. We compare each token that we're examining with the top of the stack. If they're the same, we blend them into a single token representing their concatenation; otherwise, we simply push the next token onto the stack. In either case, we reduce token sizes mod N and drop tokens with size 0.
Performance
This algorithm runs in linear time: it processes each input token exactly once, and does a constant amount of work for each token.
It cannot produce output lazily. There is no prefix of the input that is sufficient to confidently produce a prefix of the output, so we have to wait till we have consumed the entire input to produce any output. Even something that "looks bad" like ABCABCABCABCABC can eventually be cancelled out if the rest of the string is CCCBBBAAA....
The reverse at the end is a bummer, but amortized over all the tokens it is quite cheap, and in any case does not worsen our linear-time guarantee. It likewise does not change our space requirements, since we already require O(N) space to buffer the output (since as the previous note says, it's never possible to emit a partial result).
Correctness
Writing down my remark about laziness made me think of your reduce solution, which appears to produce output lazily, which I thought was impossible. The explanation, it turns out, is that your implementation is not just inelegant, as you say, but also incorrect. It produces output too soon, missing chances to cancel with later elements. The simplest test case I can find that you fail is reduce "ABABBBBAAABBBAAA" == [('A',1),('A',3)]. We can confirm that this is due to yielding results too early, by noting that take 1 $ reduce ("ABAB" ++ undefined) yields [(1, 'A')] even though elements might come later that cancel with that first A.
Minutiae
Finally note that I use a custom data type Run just to give a name to the concept; of course you can convert this to a tuple cheaply, or rewrite the function to use tuples internally if you prefer.
Implementation
import Data.List (group)
data Run a = Run Int a deriving Show
modularRLE :: Eq a => Int -> [a] -> [Run a]
modularRLE groupSize = go [] . tokenize
where go stack [] = reverse stack
go stack (Run n x : remainder) = case stack of
[] -> go (blend n []) remainder
(Run m y : prev) | x == y -> go (blend (n + m) prev) remainder
| otherwise -> go (blend n stack) remainder
where blend i s = case i `mod` groupSize of
0 -> s
j -> Run j x : s
tokenize xs = [Run (length run) x | run#(x:_) <- group xs]
λ> modularRLE 4 "AAABBBBABCCCCBBBDAAA"
[Run 1 'D',Run 3 'A']
λ> modularRLE 4 "ABABBBBAAABBBAAA"
[]
My first observation will be that you only need to code one step of the resolution, since you can pass that step to a function that will feed it its own output until it stabilizes. This function was discussed in this SO question and was given a clever answer by #galva:
--from https://stackoverflow.com/a/23924238/7096763
converge :: Eq a => (a -> a) -> a -> a
converge = until =<< ((==) =<<)
This is the entrypoint of the algorithm:
-- |-------------step----------------------| |---------------process------|
solve = converge (filter (not . isFullTuple) . collapseOne) . fmap (liftA2 (,) head length) . group
Starting from the end of the line and moving backwards (following the order of execution), we first process a String into a [(Char, Int)] using fmap (liftA2 (,) head length) . group. Then we get to a bracketed block that contains our step function. The collapseOne takes a list of tuples and collapses at most one pair of tuples, deleting the resulting tuple if necessary (if mod 4 == 0) ([('A', 1), ('A', 2), ('B', 2)] ==> [('A', 3), ('B', 2)]):
collapseOne [x] = [x]
collapseOne [] = []
collapseOne (l:r:xs)
| fst l == fst r = (fst l, (snd l + snd r) `mod` 4):xs
| otherwise = l:(collapseOne (r:xs))
You also need to know if tuples are "ripe" and need to be filtered out:
isFullTuple = (==0) . (`mod` 4) . snd
I would argue that these 8 lines of code are significantly easier to read.

What do you call the property of a list that describes the degree to which it contains duplicates?

I have a function that selects Cartesian products of lists such that the number of duplicate elements is highest:
import Data.List (nub)
f :: Eq a => [[a]] -> [[a]]
f xss = filter ((==) minLength . length . nub) cartProd
where
minLength = minimum (map (length . nub) cartProd)
cartProd = sequence xss
For example:
*Main> f [[1,2,3],[4],[1,5]]
[[1,4,1]]
But:
*Main> sequence [[1,2,3],[4],[1,5]]
[[1,4,1],[1,4,5],[2,4,1],[2,4,5],[3,4,1],[3,4,5]]
Is there a name for the property that the result of my function f has?
I believe your function is computing a minimum set cover:
Given a set of elements { 1 , 2 , . . . , n } (called the universe) and a collection S of sets whose union equals the universe, the set cover problem is to identify the smallest sub-collection of S whose union equals the universe.
In your case, n is length xss. There is one set in S for each distinct element x of concat xss, namely the set { i | x `elem` (xss !! i) } of all indices that x occurs in. The minimum set cover then tells you which x to choose from each list in xss (sometimes giving you multiple choices; any choice will produce the same final nubbed length).
Here is a worked example for your [[1,2,3],[4],[1,5]]:
The universe is {1,2,3}.
There are five sets in the collection S; I'll name them S_1 through S_5:
S_1 = {1,3} because the first and third lists contain 1.
S_2 = {1} because the first list contains 2.
S_3 = {1} because the first list contains 3.
S_4 = {2} because the second list contains 4.
S_5 = {3} because the third list contains 5.
A minimum set cover for this is {S_1, S_4}. Because this is a set cover, this means every list contains either 1 or 4. Because it is minimal, no other choice of sets produces a smaller final collection of values. So, we can choose either 1 or 4 from each list to produce a final answer. As it happens, no list contains both 1 and 4 so there is only one choice, namely [1,4,1].

Proposing an algorithm for arbitrary shape Bit Matrix Transposition with BDD-like structure

We consider a bit matrix (n x m) to be a regular array containing n lines of integers of size m.
I have looked in Hacker's Delight and in other sources and the algorithms I found for this were rather specialized: square matrices with powers of two sizes like 8x8, 32x32, 64x64, etc. (which is normal because the machine is built that way).
I thought of a more general algorithm (for arbitrary n and m) which is, in the worse case, of the expected complexity (I think), but for matrices containing mostly similar columns, or more zeros than ones, the algorithm seems a bit more interesting (in the extreme, it is linear if the matrix contains the same line over and over). It follows a sort of Binary Decision Diagram manipulation.
The output is not a transposed matrix but a compressed transposed matrix: a list of pairs (V,L) where L is an int_m that indicates the lines of the transposed matrix (by setting the bits of the corresponding position) that should contain the int_n V. The lines of the transposed matrix not appearing in any of the pairs are filled with 0.
For example, for the matrix
1010111
1111000
0001010
having the transposed
110
010
110
011
100
101
100
the algorithm outputs:
(010,0100000)
(011,0001000)
(100,0000101)
(101,0000010)
(110,1010000)
and one reads the pair (100,0000101) as meaning "the value 100 is put in the 5th and the 7th line of the transposed matrix".
This is the algorithm (written in a pseudo-OCaml/C) and a picture of the progress of the algorithm on the above example.
We will run according to triples (index_of_current_line, V, L), which is of type (int, int_n, int_m), where int_n is the type of n-bit wide integers and int is just a machine integer wide enough to hold n.
The function takes a list of these triples, the matrix, the number of lines and an accumulator for the output (list of pairs (int_m, int_n)) and returns, at some point, that accumulator.
list of (int_n, int_m) transpose(list of triple t,
int_m[n] mat,
int n,
list of (int_n, int_m) acc)
The first call of the transpose function is
transpose([(0, 0, 2^m-1)], mat, n, []).
take "&", "|" "xor" to be the usual bit-wise operations
transpose(t, mat, n, acc) =
match t with
| [] -> (* the list is empty, we're done *)
return acc
| (i, v, l)::tt ->
let colIn = mat[i] & l in
(* colIn contains the positions that were set in the parent's mask "l"
and that are also set in the line "i" *)
match colIn with
|0 -> (* None of the positions are set in both, do not branch *)
if (i<n) then (* not done with the matrix, simply move to next line *)
transpose((i+1,v,l)::tt,mat,n,acc)
else (* we reached the end of the matrix, we're at a leaf *)
if (v>0) then
transpose(tt,mat,n,(v,l)::acc)
else (* We ignore the null values and continue *)
transpose(tt,mat,n,acc)
|_ -> (* colIn is non null, ie some of the positions set at the parent
mask "l" are also set in this line. If ALL the positions are, we
do not branch either. If only some of them are and some of them
are not, we branch *)
(* First, update v *)
let vv = v | (2^(n-i-1)) in
(* Then get the mask for the other branch *)
let colOut = colIn xor l in,
match colOut with
| 0 -> (* All are in, none are out, no need to branch *)
if (i<n) then
transpose((i+1,vv,colIn)::tt,mat,n,acc)
else (* we reached the end of the matrix, we're at a leaf *)
transpose(tt,mat,n,(vv,colIn)::acc)
| _ -> (* Some in, some out : now we branch *)
if (i<n) then
transpose((i+1,vv,colIn)::(i+1,v,colOut)::tt,mat,n,acc)
else
if (v>0) then
transpose(tt,mat,n,(vv,colIn)::(v,colOut)::acc)
else
transpose(tt,mat,n,(vv,colIn)::acc)
Notice that if the matrix is wider than it is high, it is even faster (if n = 3 and m = 64 for instance)
My questions are :
Is this interesting and/or useful?
Am I reinventing the wheel?
Are "almost zero" matrices or "little differentiated-lines" matrices common enough for this to be interesting ?
PS: If anything doesn't seem clear, please do tell, I will rewrite whatever needs to be!

Efficiently find lowest sum paths

This is a big ask but I'm a bit stuck!
I am wondering if there is a name for this problem, or a similar one.
I am probably over complicating finding the solution but I can't think of a way without a full brute force exhaustive search (my current implementation). This is not acceptable for the application involved.
I am wondering if there are any ways of simplifying this problem, or implementation strategies I could employ (language/tool choice is open).
Here is a quick description of the problem:
Given n sequences of length k:
a = [0, 1, 1] == [a1, a2, a3]
b = [1, 0, 2] == [b1, b2, b3]
c = [0, 0, 2] == [c1, c2, c3]
find paths of length k through the sequences as so (i'll give examples starting at a1, but hopefully you get the idea the same paths need to be derived from b1, c1)
a1 -> a2 -> a3
a1 -> b1 -> b2
a1 -> b1 -> a2
a1 -> b1 -> c1
a1 -> c1 -> c2
a1 -> c1 -> a2
a1 -> c1 -> b1
I want to know, which path(s) are going to have the lowest sum:
a1 -> a2 -> a3 == 2
a1 -> b1 -> b2 == 1
a1 -> b1 -> a2 == 2
a1 -> b1 -> c1 == 1
a1 -> c1 -> c2 == 0
a1 -> c1 -> a2 == 1
a1 -> c1 -> b1 == 1
So in this case, out of the sample a1 -> c1 -> c2 is the lowest.
EDIT:
Sorry, just to clear up the rules for deriving the path.
For example you can move from node a1 to b2 if you haven't already exhausted b2, and have exhausted the previous node in that sequence (b1).
An alternative solution using Dynamic Programming
Let's assume the arrays are given as a matrix A such that each row is identical to one of the original arrays. Your matrix will be of size (n+1)x(k+1), and make sure that A[_][0] = 0
Now, use DP to solve it:
f(x,y,z) = min { f(i,y,z-1) | x < i <= n} [union] { f(i+1,0,z) } + A[x][y]
f(_,_,0) = 0
f(n,k,z) = infinity for each z > 0
Idea: In each step you can choose to go to each of the following lines (same column) - or go to the next column, while decreasing the number of more nodes needed.
Moving to the next column is done via the dummy index A[_][0], without decreasing number of nodes needed to go more and without cost, since A[_][0] = 0.
Complexity:
This solution is basically a brute force, but using memorization of each already explored value of f(_,_,_) you basically need only to fill a matrix of size O(n*k^2), where each cell takes O(n) time to compute on first look- but in practice can be computed iteratively in O(1) per step, because you only need to minimize with the new element in the row1. This gives you O(n*k^2) - better than brute force.
(1) This is done by min{x1,x2,x3,...,xk} = min{x_k, min{x1,...,k_k-1}}, and we already know min{x1,...,k_k-1}
You can implement a modified version of A* algorithm.
Copy the matrix and fill it will 0s
foreach secondary diagonal m from the last to first
Foreach cell n in m
4. New matrix's cell n = old matrix cell n minus min(cell bellow n in new matrix, cell to the right of n in the new matrix).
Cell 0,0 in the new matrix is the shortest path
**implement A algorithem over the pseudocode above.

Evaluate all possible interpretations in OCaml

I need to evaluate whether two formulas are equivalent or not. Here, I use a simple definition of formula, which is a prefix formula.
For example, And(Atom("b"), True) means b and true, while And(Atom("b"), Or(Atom("c"), Not(Atom("c")))) means (b and (c or not c))
My idea is simple, get all atoms, apply every combination (for my cases, I will have 4 combination, which are true-true, true-false, false-true, and false-false). The thing is, I don't know how to create these combinations.
For now, I have known how to get all involving atoms, so in case of there are 5 atoms, I should create 32 combinations. How to do it in OCaml?
Ok, so what you need is a function combinations n that will produce all the booleans combinations of length n; let's represent them as lists of lists of booleans (i.e. a single assignment of variables will be a list of booleans). Then this function would do the job:
let rec combinations = function
| 0 -> [[]]
| n ->
let rest = combinations (n - 1) in
let comb_f = List.map (fun l -> false::l) rest in
let comb_t = List.map (fun l -> true::l) rest in
comb_t # comb_f
There is only one empty combination of length 0 and for n > 0 we produce combinations of n-1 and prefix them with false and with true to produce all possible combinations of length n.
You could write a function to print such combinations, let's say:
let rec combinations_to_string = function
| [] -> ""
| x::xs ->
let rec bools_to_str = function
| [] -> ""
| b::bs -> Printf.sprintf "%s%s" (if b then "T" else "F") (bools_to_str bs)
in
Printf.sprintf "[%s]%s" (bools_to_str x) (combinations_to_string xs)
and then test it all with:
let _ =
let n = int_of_string Sys.argv.(1) in
let combs = combinations n in
Printf.eprintf "combinations(%d) = %s\n" n (combinations_to_string combs)
to get:
> ./combinations 3
combinations(3) = [TTT][TTF][TFT][TFF][FTT][FTF][FFT][FFF]
If you think of a list of booleans as a list of bits of fixed length, there is a very simple solution: Count!
If you want to have all combinations of 4 booleans, count from 0 to 15 (2^4 - 1) -- then interpret each bit as one of the booleans. For simplicity I'll use a for-loop, but you can also do it with a recursion:
let size = 4 in
(* '1 lsl size' computes 2^size *)
for i = 0 to (1 lsl size) - 1 do
(* from: is the least significant bit '1'? *)
let b0 = 1 = ((i / 1) mod 2) in
let b1 = 1 = ((i / 2) mod 2) in
let b2 = 1 = ((i / 4) mod 2) in
(* to: is the most significant bit '1'? *)
let b3 = 1 = ((i / 8) mod 2) in
(* do your thing *)
compute b0 b1 b2 b3
done
Of course you can make the body of the loop more general so that it e.g. creates a list/array of booleans depending on the size given above etc.;
The point is that you can solve this problem by enumerating all values you are searching for. If this is the case, compute all integers up to your problem size. Write a function that generates a value of your original problem from an integer. Put it all together.
This method has the advantage that you do not need to first create all combinations, before starting your computation. For large problems this might well save you. For rather small size=16 you will already need 65535 * sizeof(type) memory -- and this is growing exponentially with the size! The above solution will require only a constant amount of memory of sizeof(type).
And for science's sake: Your problem is NP-complete, so if you want the exact solution, it will take exponential time.

Resources