Modular run-length encoding - algorithm

Question
How to implement a run length encoding modulus n>=1? For n=4, considering the inputAAABBBBABCCCCBBBDAAA, we want an output of [('D', 1), ('A', 3)]. Note the long-distance merging due to the modulus operation. See Explanation.
Explanation
The first occurance of BBBB encodes to (B, 4) whose modulus 4 is (B, 0), thus canceling itself out. See the diagram (ignore spaces; they are simply for illustrative purposes):
AAABBBBABCCCCBBBDAAA
A3 B4 ABCCCCBBBDAAA
A3 B0 ABCCCCBBBDAAA
A3 ABCCCCBBBDAAA
A4 BCCCCBBBDAAA
A0 BCCCCBBBDAAA
BCCCCBBBDAAA
...
DA3
A simpler example when no merging happens since none gets canceled by modulus 4: input AAABABBBC produces output [('A',3),('B',1),('A',1),('B',3),('C',1)].
Requirements
Haskell implementations are preferred but others are welcome too!
Prefer standard/common library functions over 3rd party libraries.
Prefer readable and succint programs utilizing higher-order functions.
Prefer efficiency (do not loop over the whole list whenever unnecessary)
My program
I implemented this in Haskell, but it looks too verbose and awful to read. The key idea is to check three tuples at a time, and only advance one tuple forward if we can neither cancel out 0 tuples nor merge a pair of tuples among the three tuples at hand.
import Data.List (group)
test = [('A', 1), ('A', 2), ('B', 2), ('B', 2), ('A', 1), ('B', 1), ('C', 1), ('C', 3), ('B', 3), ('D', 1), ('A', 3)] :: [(Char, Int)]
expected = [('D', 1), ('A', 3)] :: [(Char, Int)]
reduce' :: [(Char, Int)] -> [(Char, Int)]
reduce' [ ] = [] -- exit
reduce' ( (_,0):xs) = reduce' xs
reduce' (x1:(_,0):xs) = reduce' (x1:xs)
reduce' ( (x,n):[]) = (x,n):[] -- exit
reduce' ( (x1,n1):(x2,n2):[]) -- [previous,current,NONE]
| x1 == x2 = reduce' ((x1, d4 (n1+n2)):[])
| otherwise = (x1,n1):( -- advance
reduce' ((x2, d4 n2 ):[]))
reduce' ((x1,n1):(x2,n2):(x3,n3):xs) -- [previous,current,next]
| n3 == 0 = reduce' ((x1, d4 n1 ):(x2, d4 n2 ):xs)
| n2 == 0 = reduce' ((x1, d4 n1 ):(x3, d4 n3 ):xs)
| x2 == x3 = reduce' ((x1, d4 n1 ):(x2, d4 (n2+n3)):xs)
| x1 == x2 = reduce' ((x2, d4 (n1+n2)):(x3, d4 n3 ):xs)
| otherwise = (x1,n1):( -- advance
reduce' ((x2, d4 n2 ):(x3, d4 n3 ):xs)
)
-- Helpers
flatten :: [(Char, Int)] -> String
flatten nested = concat $ (\(x, n) -> replicate n x) <$> nested
nest :: String -> [(Char, Int)]
nest flat = zip (head <$> xg) (d4 .length <$> xg)
where xg = group flat
reduce = reduce' . nest
d4 = (`rem` 4)
Thoughts
My inputs are like the test variable in the snipped above. We could keep doing flatten then nest until its result doesn't change, and would definitely look simpler. But it feels it is scanning the whole list many times, while my 3-pointer implementation scans the whole list only once. Maybe we can pop an element from left and add it to a new stack while merging identical consecutive items? Or maybe use Applicative Functors? E.g. this works but not sure about its efficiency/performance: reduce = (until =<< ((==) =<<)) (nest . flatten).

Algorithm
I think you are making this problem much harder by thinking of it in terms of character strings at all. Instead, do a preliminary pass that just does the boring RLE part. This way, a second pass is comparatively easy, because you can work in "tokens" that represent a run of a certain length, instead of having to work one character at a time.
The only data structure we need to maintain as we do the second pass through the list is a stack, and we only ever need to look at its top element. We compare each token that we're examining with the top of the stack. If they're the same, we blend them into a single token representing their concatenation; otherwise, we simply push the next token onto the stack. In either case, we reduce token sizes mod N and drop tokens with size 0.
Performance
This algorithm runs in linear time: it processes each input token exactly once, and does a constant amount of work for each token.
It cannot produce output lazily. There is no prefix of the input that is sufficient to confidently produce a prefix of the output, so we have to wait till we have consumed the entire input to produce any output. Even something that "looks bad" like ABCABCABCABCABC can eventually be cancelled out if the rest of the string is CCCBBBAAA....
The reverse at the end is a bummer, but amortized over all the tokens it is quite cheap, and in any case does not worsen our linear-time guarantee. It likewise does not change our space requirements, since we already require O(N) space to buffer the output (since as the previous note says, it's never possible to emit a partial result).
Correctness
Writing down my remark about laziness made me think of your reduce solution, which appears to produce output lazily, which I thought was impossible. The explanation, it turns out, is that your implementation is not just inelegant, as you say, but also incorrect. It produces output too soon, missing chances to cancel with later elements. The simplest test case I can find that you fail is reduce "ABABBBBAAABBBAAA" == [('A',1),('A',3)]. We can confirm that this is due to yielding results too early, by noting that take 1 $ reduce ("ABAB" ++ undefined) yields [(1, 'A')] even though elements might come later that cancel with that first A.
Minutiae
Finally note that I use a custom data type Run just to give a name to the concept; of course you can convert this to a tuple cheaply, or rewrite the function to use tuples internally if you prefer.
Implementation
import Data.List (group)
data Run a = Run Int a deriving Show
modularRLE :: Eq a => Int -> [a] -> [Run a]
modularRLE groupSize = go [] . tokenize
where go stack [] = reverse stack
go stack (Run n x : remainder) = case stack of
[] -> go (blend n []) remainder
(Run m y : prev) | x == y -> go (blend (n + m) prev) remainder
| otherwise -> go (blend n stack) remainder
where blend i s = case i `mod` groupSize of
0 -> s
j -> Run j x : s
tokenize xs = [Run (length run) x | run#(x:_) <- group xs]
λ> modularRLE 4 "AAABBBBABCCCCBBBDAAA"
[Run 1 'D',Run 3 'A']
λ> modularRLE 4 "ABABBBBAAABBBAAA"
[]

My first observation will be that you only need to code one step of the resolution, since you can pass that step to a function that will feed it its own output until it stabilizes. This function was discussed in this SO question and was given a clever answer by #galva:
--from https://stackoverflow.com/a/23924238/7096763
converge :: Eq a => (a -> a) -> a -> a
converge = until =<< ((==) =<<)
This is the entrypoint of the algorithm:
-- |-------------step----------------------| |---------------process------|
solve = converge (filter (not . isFullTuple) . collapseOne) . fmap (liftA2 (,) head length) . group
Starting from the end of the line and moving backwards (following the order of execution), we first process a String into a [(Char, Int)] using fmap (liftA2 (,) head length) . group. Then we get to a bracketed block that contains our step function. The collapseOne takes a list of tuples and collapses at most one pair of tuples, deleting the resulting tuple if necessary (if mod 4 == 0) ([('A', 1), ('A', 2), ('B', 2)] ==> [('A', 3), ('B', 2)]):
collapseOne [x] = [x]
collapseOne [] = []
collapseOne (l:r:xs)
| fst l == fst r = (fst l, (snd l + snd r) `mod` 4):xs
| otherwise = l:(collapseOne (r:xs))
You also need to know if tuples are "ripe" and need to be filtered out:
isFullTuple = (==0) . (`mod` 4) . snd
I would argue that these 8 lines of code are significantly easier to read.

Related

Generate all unique directed graphs with 2 inputs to each node

I'm trying to generate all unique digraphs that fit a spec:
each node must have exactly 2 inputs
and are allowed arbitrarily many outputs to other nodes in the graph
My current solution is slow. Eg for 6 nodes, the algo has taken 1.5 days to get where I think it's complete, but it'll probably be checking for a few more days still.
My algorithm for a graph with n nodes:
generate all n-length strings of 0, where one symbol is a 1, eg, for n=3, [[0,0,1], [0,1,0], [1,0,0]]. These can be thought of as rows from an identity matrix.
generate all possible n * n matrixes where each row is all possible combinations of step 1. + step 1.
This is the connectivity matrix where each cell represents a connection from column-index to row-index
So, for n=3, these are possible:
[0,1,0] + [1,0,0] = [1,1,0]
[1,0,0] + [1,0,0] = [2,0,0]
These represent the inputs to a node, and by adding step 1 to itself, the result will always represent 2 inputs.
For ex:
A B C
A' [[0,1,1],
B' [0,2,0],
C' [1,1,0]]
So B and C connect to A once each: B -> A', C -> A',
And B connects to itself twice: B => B'
I only want unique ones, so for each connectivity matrix generated, I can only keep it if it is not isomorphic to an already-seen graph.
This step is expensive. I need to convert the graph to a "canonical form" by running through each permutation of isomorphic graphs, sorting them, and considering the first one as the "canonical form".
If anyone dives into testing any of this out, here are the count of unique graphs for n nodes:
2 - 6
3 - 44
4 - 475
5 - 6874
6 - 109,934 (I think, it's not done running yet but I haven't found a new graph in >24 hrs.)
7 - I really wanna know!
Possible optimizations:
since I get to generate the graphs to test, is there a way of ruling them out, without testing, as being isomorphic to already-seen ones?
is there a faster graph-isomorphism algorithm? I think this one is related to "Nauty", and there are others I've read of in papers, but I haven't had the expertise (or bandwidth) to implement them yet.
Here's a demonstrable connectivity matrix that can be plotted at graphonline.ru for fun, showing self connections, and 2 connections to t he same node:
1, 0, 0, 0, 0, 1,
1, 0, 0, 0, 1, 0,
0, 1, 0, 1, 0, 0,
0, 1, 2, 0, 0, 0,
0, 0, 0, 1, 0, 1,
0, 0, 0, 0, 1, 0,
here's the code in haskell if you want to play with it, but I'm more concerned about getting the algorithm right (eg pruning down the search space), than the implementation:
-- | generate all permutations of length n given symbols from xs
npermutations :: [a] -> Int -> [[a]]
npermutations xs size = mapM (const xs) [1..size]
identity :: Int -> [[Int]]
identity size = scanl
(\xs _ -> take size $ 0 : xs) -- keep shifting right
(1 : (take (size - 1) (repeat 0))) -- initial, [1,0,0,...]
[1 .. size-1] -- correct size
-- | return all possible pairings of [Column]
columnPairs :: [[a]] -> [([a], [a])]
columnPairs xs = (map (\x y -> (x,y)) xs)
<*> xs
-- | remove duplicates
rmdups :: Ord a => [a] -> [a]
rmdups = rmdups' Set.empty where
rmdups' _ [] = []
rmdups' a (b : c) = if Set.member b a
then rmdups' a c
else b : rmdups' (Set.insert b a) c
-- | all possible patterns for inputting 2 things into one node.
-- eg [0,1,1] means cells B, and C project into some node
-- [0,2,0] means cell B projects twice into one node
binaryInputs :: Int -> [[Int]]
binaryInputs size = rmdups $ map -- rmdups because [1,0]+[0,1] is same as flipped
(\(x,y) -> zipWith (+) x y)
(columnPairs $ identity size)
transposeAdjMat :: [[Int]] -> [[Int]]
transposeAdjMat ([]:_) = []
transposeAdjMat m = (map head m) : transposeAdjMat (map tail m)
-- | AdjMap [(name, inbounds)]
data AdjMap a = AdjMap [(a, [a])] deriving (Show, Eq)
addAdjColToMap :: Int -- index
-> [Int] -- inbound
-> AdjMap Int
-> AdjMap Int
addAdjColToMap ix col (AdjMap xs) =
let conns = foldl (\c (cnt, i) -> case cnt of
1 -> i:c
2 -> i:i:c
_ -> c
)
[]
(zip col [0..]) in
AdjMap ((ix, conns) : xs)
adjMatToMap :: [[Int]] -> AdjMap Int
adjMatToMap cols = foldl
(\adjMap#(AdjMap nodes) col -> addAdjColToMap (length nodes) col adjMap)
(AdjMap [])
cols
-- | a graph's canonical form : http://mfukar.github.io/2015/09/30/haskellxiii.html
-- very expensive algo, of course
canon :: (Ord a, Enum a, Show a) => AdjMap a -> String
canon (AdjMap g) = minimum $ map f $ Data.List.permutations [1..(length g)]
where
-- Graph vertices:
vs = map fst g
-- Find, via brute force on all possible orderings (permutations) of vs,
-- a mapping of vs to [1..(length g)] which is minimal.
-- For example, map [1, 5, 6, 7] to [1, 2, 3, 4].
-- Minimal is defined lexicographically, since `f` returns strings:
f p = let n = zip vs p
in (show [(snd x, sort id $ map (\x -> snd $ head $ snd $ break ((==) x . fst) n)
$ snd $ take_edge g x)
| x <- sort snd n])
-- Sort elements of N in ascending order of (map f N):
sort f n = foldr (\x xs -> let (lt, gt) = break ((<) (f x) . f) xs
in lt ++ [x] ++ gt) [] n
-- Get the first entry from the adjacency list G that starts from the given node X
-- (actually, the vertex is the first entry of the pair, hence `(fst x)`):
take_edge g x = head $ dropWhile ((/=) (fst x) . fst) g
-- | all possible matrixes where each node has 2 inputs and arbitrary outs
binaryMatrixes :: Int -> [[[Int]]]
binaryMatrixes size = let columns = binaryInputs size
unfiltered = mapM (const columns) [1..size] in
fst $ foldl'
(\(keep, seen) x -> let can = canon . adjMatToMap $ x in
(if Set.member can seen
then keep
else id $! x : keep
, Set.insert can seen))
([], Set.fromList [])
unfiltered
There are a number of approaches you could try. One thing that I do note is that having loops with multi-edges (colored loops?) is a little unusual, but is probably just needs a refinement of existing techniques.
Filter the output of another program
The obvious candidate here is of course nAUTy/traces (http://pallini.di.uniroma1.it/) or similar (saucy, bliss, etc). Depending on how you want to do this, it could be as simple as run nauty (for example) and output to file, then read in the list filtering as you go.
For larger values of n this could start to be a problem if you are generating huge files. I'm not sure whether you start to run out of space before you run out of time, but still. What might be better is to generate and test them as you go, throwing away candidates. For your purposes, there may be an existing library for generation - I found this one but I have no idea how good it is.
Use graph invariants
A very easy first step to more efficient listing of graphs is to filter using graph invariants. An obvious one would be degree sequence (the ordered list of degrees of the graph). Others include the number of cycles, the girth, and so on. For your purposes, there might be some indegree/outdegree sequence you could use.
The basic idea is to use the invariant as a filter to avoid expensive checks for isomorphism. You can store the (list of ) invariants for already generated graphs, and check the new one against the list first. The canonical form of a structure is a kind of invariant.
Implement an algorithm
There are lost of GI algorithms, including the ones used by nauty and friends. However, they do tend to be quite hard! The description given in this answer is an excellent overview, but the devil is in the details of course.
Also note that the description is for general graphs, while you have a specific subclass of graph that might be easier to generate. There may be papers out there for digraph listing (generating) but I have not checked.

Haskell Recursion - finding largest difference between numbers in list

Here's the problem at hand: I need to find the largest difference between adjacent numbers in a list using recursion. Take the following list for example: [1,2,5,6,7,9]. The largest difference between two adjacent numbers is 3 (between 2 and 5).
I know that recursion may not be the best solution, but I'm trying to improve my ability to use recursion in Haskell.
Here's the current code I currently have:
largestDiff (x:y:xs) = if (length (y:xs) > 1) then max((x-y), largestDiff (y:xs)) else 0
Basically - the list will keep getting shorter until it reaches 1 (i.e. no more numbers can be compared, then it returns 0). As 0 passes up the call stack, the max function is then used to implement a 'King of the Hill' type algorithm. Finally - at the end of the call stack, the largest number should be returned.
Trouble is, I'm getting an error in my code that I can't work around:
Occurs check: cannot construct the infinite type:
t1 = (t0, t1) -> (t0, t1)
In the return type of a call of `largestDiff'
Probable cause: `largestDiff' is applied to too few arguments
In the expression: largestDiff (y : xs)
In the first argument of `max', namely
`((x - y), largestDiff (y : xs))'
Anyone have some words of wisdom to share?
Thanks for your time!
EDIT: Thanks everyone for your time - I ended up independently discovering a much simpler way after much trial and error.
largestDiff [] = error "List too small"
largestDiff [x] = error "List too small"
largestDiff [x,y] = abs(x-y)
largestDiff (x:y:xs) = max(abs(x-y)) (largestDiff (y:xs))
Thanks again, all!
So the reason why your code is throwing an error is because
max((x-y), largestDiff (y:xs))
In Haskell, you do not use parentheses around parameters and separate them by commas, the correct syntax is
max (x - y) (largestDiff (y:xs))
The syntax you used is getting parsed as
max ((x - y), largestDiff (y:xs))
Which looks like you're passing a tuple to max!
However, this does not solve the problem. I always got 0 back. Instead, I would recommend breaking up the problem into two functions. You want to calculate the maximum of the difference, so first write a function to calculate the differences and then a function to calculate the maximum of those:
diffs :: Num a => [a] -> [a]
diffs [] = [] -- No elements case
diffs [x] = [] -- One element case
diffs (x:y:xs) = y - x : diffs (y:xs) -- Two or more elements case
largestDiff :: (Ord a, Num a) => [a] -> a
largestDiff xs = maximum $ map abs $ diffs xs
Notice how I've pulled the recursion out into the simplest possible case. We didn't need to calculate the maximum as we traversed the list; it's possible, just more complex. Since Haskell has a handy built-in function for calculating the maximum of a list for us, we can also leverage that. Our recursive function is clean and simple, and it is then combined with maximum to implement the desired largestDiff. As an FYI, diffs is really just a function to compute the derivative of a list of numbers, it can be a very useful function for data processing.
EDIT: Needed Ord constraint on largestDiff and added in map abs before calculating maximum.
Here's my take at it.
First some helpers:
diff a b = abs(a-b)
pick a b = if a > b then a else b
Then the solution:
mdiff :: [Int] -> Int
mdiff [] = 0
mdiff [_] = 0
mdiff (a:b:xs) = pick (diff a b) (mdiff (b:xs))
You have to provide two closing clauses, because the sequence might have either even or odd number of elements.
Another solution to this problem, which circumvents your error, can be obtained
by just transforming lists and folding/reducing them.
import Data.List (foldl')
diffs :: (Num a) => [a] -> [a]
diffs x = zipWith (-) x (drop 1 x)
absMax :: (Ord a, Num a) => [a] -> a
absMax x = foldl' max (fromInteger 0) (map abs x)
Now I admit this is a bit dense for a beginner, so I will explain the above.
The function zipWith transforms two given lists by using a binary function,
which is (-) in this case.
The second list we pass to zipWith is drop 1 x, which is just another way of
describing the tail of a list, but where tail [] results in an error,
drop 1 [] just yields the empty list. So drop 1 is the "safer" choice.
So the first function calculates the adjacent differences.
The name of the second function suggests that it calculates the maximum absolute
value of a given list, which is only partly true, it results in "0" if passed an
empty list.
But how does this happen, reading from right to left, we see that map abs
transforms every list element to its absolute value, which is asserted by
the Num a constraint. Then the foldl'-function traverses the list and
accumulates the maximum of the previous accumulator and the current element of
the list traversal. Moreover I'd like to mention that foldl' is the "strict"
sister/brother of the foldl-function, where the latter is rarely of use,
because it tends to build up a bunch of unevaluated expressions called thunks.
So let's quit all this blah blah and see it in action ;-)
> let a = diffs [1..3] :: [Int]
>>> zipWith (-) [1,2,3] (drop 1 [1,2,3])
<=> zipWith (-) [1,2,3] [2,3]
<=> [1-2,2-3] -- zipWith stops at the end of the SHORTER list
<=> [-1,-1]
> b = absMax a
>>> foldl' max (fromInteger 0) (map abs [-1,-1])
-- fromInteger 0 is in this case is just 0 - interesting stuff only happens
-- for other numerical types
<=> foldl' max 0 (map abs [-1,-1])
<=> foldl' max 0 [1,1]
<=> foldl' max (max 0 1) [1]
<=> foldl' max 1 [1]
<=> foldl' max (max 1 1) []
<=> foldl' max 1 [] -- foldl' _ acc [] returns just the accumulator
<=> 1

Recursion confusion in Haskell again - subsets with an inclusion test

I'm testing a simple program to generate subsets with an inclusion test. For example, given
*Main Data.List> factorsets 7
[([2],2),([2,3],1),([3],1),([5],1),([7],1)]
calling chooseP 3 (factorsets 7), I would like to get (read from right to left, a la cons)
[[([5],1),([3],1),([2],2)]
,[([7],1),([3],1),([2],2)]
,[([7],1),([5],1),([2],2)]
,[([7],1),([5],1),([2,3],1)]
,[([7],1),([5],1),([3],1)]]
But my program is returning an extra [([7],1),([5],1),([3],1)] (and missing a [([7],1),([5],1),([2],2)]):
[[([5],1),([3],1),([2],2)]
,[([7],1),([3],1),([2],2)]
,[([7],1),([5],1),([3],1)]
,[([7],1),([5],1),([2,3],1)]
,[([7],1),([5],1),([3],1)]]
The inclusion test is: members' first part of the tuple must have a null intersection.
Once tested as working, the plan is to sum the internal products of each subset's snds, rather than accumulate them.
Since I've asked a similar question before, I imagine that an extra branch is generated since when the recursion splits at [2,3], the second branch runs over the same possibilities once it passes the skipped section. Any pointers on how to resolve that would be appreciated; and if you'd like to share ideas about how to enumerate and sum such product combinations more efficiently, that would be great, too.
Haskell code:
chooseP k xs = chooseP' xs [] 0 where
chooseP' [] product count = if count == k then [product] else []
chooseP' yys product count
| count == k = [product]
| null yys = []
| otherwise = f ++ g
where (y:ys) = yys
(factorsY,numY) = y
f = let zzs = dropWhile (\(fs,ns) -> not . and . map (null . intersect fs . fst) $ product) yys
in if null zzs
then chooseP' [] product count
else let (z:zs) = zzs in chooseP' zs (z:product) (count + 1)
g = if and . map (null . intersect factorsY . fst) $ product
then chooseP' ys product count
else chooseP' ys [] 0
Your code is complicated enough that I might recommend starting over. Here's how I would proceed.
Write a specification. Let it be as stupidly inefficient as necessary -- for example, the spec I choose below will build all combinations of k elements from the list, then filter out the bad ones. Even the filter will be stupidly slow.
sorted xs = sort xs == xs
unique xs = nub xs == xs
disjoint xs = and $ liftM2 go xs xs where
go x1 x2 = x1 == x2 || null (intersect x1 x2)
-- check that x is valid according to all the validation functions in fs
-- (there are other fun ways to spell this, but this is particularly
-- readable and clearly correct -- just what we want from a spec)
allFuns fs x = all ($x) fs
choosePSpec k = filter good . replicateM k where
good pairs = allFuns [unique, disjoint, sorted] (map fst pairs)
Just to make sure it's right, we can test it at the prompt:
*Main> mapM_ print $ choosePSpec 3 [([2],2),([2,3],1),([3],1),([5],1),([7],1)]
[([2],2),([3],1),([5],1)]
[([2],2),([3],1),([7],1)]
[([2],2),([5],1),([7],1)]
[([2,3],1),([5],1),([7],1)]
[([3],1),([5],1),([7],1)]
Looks good.
Now that we have a spec, we can try to improve the speed one refactoring at a time, always checking that it matches the spec. The first thing I'd want to do is notice that we can ensure uniqueness and sortedness just by sorting the input and picking things "in an increasing way". To do this, we can define a function which chooses subsequences of a given length. It piggy-backs on the tails function, which you can think of as nondeterministically choosing a place to split its input list.
subseq 0 xs = [[]]
subseq n xs = do
x':xt <- tails xs
xs' <- subseq (n-1) xt
return (x':xs')
Here's an example of this function in action:
*Main> subseq 3 [1..4]
[[1,2,3],[1,2,4],[1,3,4],[2,3,4]]
Now we can write a slightly faster chooseP by replacing replicateM with subseq. Recall that we're assuming the inputs are already sorted and unique, though.
choosePSlow k = filter good . subseq k where
good pairs = disjoint $ map fst pairs
We can sanity-check that it's working by running it on the particular input we have from above:
*Main> let i = [([2],2),([2,3],1),([3],1),([5],1),([7],1)]
*Main> choosePSlow 3 i == choosePSpec 3 i
True
Or, better yet, we can stress-test it with QuickCheck. We'll need a tiny bit more code. The condition k < 5 is just because the spec is so hopelessly slow that bigger values of k take forever.
propSlowMatchesSpec :: NonNegative Int -> OrderedList ([Int], Int) -> Property
propSlowMatchesSpec (NonNegative k) (Ordered xs)
= k < 5 && unique (map fst xs)
==> choosePSlow k xs == choosePSpec k xs
*Main> quickCheck propSlowMatchesSpec
+++ OK, passed 100 tests.
There are several more opportunities to make things faster. For instance, the disjoint test could be sped up using choose 2 instead of liftM2; or we might be able to ensure disjointness during element selection and prune the search even earlier; etc. How you want to improve it from here I leave to you -- but the basic technique (start with stupid and slow, then make it smarter, testing as you go) should be helpful to you.

Writing infinite list to skip every factor of p?

How can I efficiently represent the list [0..] \\ [t+0*p, t+1*p ..]?
I have defined:
Prelude> let factors p t = [t+0*p, t+1*p ..]
I want to efficiently represent an infinite list that is the difference of [0..] and factors p t, but using \\ from Data.List requires too much memory for even medium-sized lists:
Prelude Data.List> [0..10000] \\ (factors 5 0)
<interactive>: out of memory
I know that I can represent the values between t+0*p and t+1*p with:
Prelude> let innerList p1 p2 t = [t+p1+1, t+p1+2 .. t+p2-1]
Prelude> innerList 0 5 0
[1,2,3,4]
However, repeatedly calculating and concatenating innerList for increasing intervals seems clumsy.
Can I efficiently represent [0..] \\ (factors p t) without calculating rem or mod for each element?
For the infinite list [0..] \\ [t,t+p..],
yourlist t p = [0..t-1] ++ [i | m <- [0,p..], i <- [t+m+1..t+m+p-1]]
Of course this approach doesn't scale, at all, if you'd want to remove some other factors, like
[0..] \\ [t,t+p..] \\ [s,s+q..] \\ ...
in which case you'll have to remove them in sequence with minus, mentioned in Daniel Fischer's answer. There is no magic bullet here.
But there's also a union, with which the above becomes
[0..] \\ ( [t,t+p..] `union` [s,s+q..] `union` ... )
the advantage is, we can arrange the unions in a tree, and get algorithmic improvement.
You can't use (\\) for that, because
(\\) :: (Eq a) => [a] -> [a] -> [a]
(\\) = foldl (flip delete)
the list of elements you want to remove is infinite, and a left fold never terminates when the list it folds over is infinite.
If you rather want to use something already written than write it yourself, you can use minus from the data-ordlist package.
The performance should be adequate.
Otherwise,
minus :: Ord a => [a] -> [a] -> [a]
minus xxs#(x:xs) yys#(y:ys)
| x < y = x : minus xs yys
| x == y = minus xs ys
| otherwise = minus xss ys
minus xs _ = xs
You can use a list comprehesion with a predicate, using rem:
>>> let t = 0
>>> let p = 5
>>> take 40 $ [ x | x <- [1..], x `rem` p /= t ]
[1,2,3,4,6,7,8,9,11,12,13,14,16,17,18,19,21,22,23,24,26,27,28,29,31,32,33,34,36,37,38,39,41,42,43,44,46,47,48,49]
If you want efficiency, why does your solution have to use list comprehension syntax?
Why not something like this?
gen' n i p | i == p = gen' (n + p) 1 p
gen' n i p = (n+i) : gen' n (i+1) p
gen = gen' 0 1
and then do
gen 5
Because you have ascending lists, you can simply lazily merge them:
nums = [1..]
nogos = factors p t
result = merge nums (dropWhile (<head nums) nogos) where
merge (a:as) (b:bs)
| a < b = a : merge as (b:bs)
| a == b = merge as bs
| otherwise = error "should not happen"
Writing this in a general way so that we have a function that builds the difference of two infinite lists, provided only that they are in ascending order, is left as exercise. In the end, the following should be possible
[1..] `infiniteDifference` primes `infiniteDifference` squares
For this, make it a left associative operator.

variant of pascal's triangle in haskell - problem with lazy evaluation

To solve some problem I need to compute a variant of the pascal's triangle which is defined like this:
f(1,1) = 1,
f(n,k) = f(n-1,k-1) + f(n-1,k) + 1 for 1 <= k < n,
f(n,0) = 0,
f(n,n) = 2*f(n-1,n-1) + 1.
For n given I want to efficiently get the n-th line (f(n,1) .. f(n,n)). One further restriction: f(n,k) should be -1 if it would be >= 2^32.
My implementation:
next :: [Int64] -> [Int64]
next list#(x:_) = x+1 : takeWhile (/= -1) (nextRec list)
nextRec (a:rest#(b:_)) = boundAdd a b : nextRec rest
nextRec [a] = [boundAdd a a]
boundAdd x y
| x < 0 || y < 0 = -1
| x + y + 1 >= limit = -1
| otherwise = (x+y+1)
-- start shoud be [1]
fLine d start = until ((== d) . head) next start
The problem: for very large numbers I get a stack overflow. Is there a way to force haskell to evaluate the whole list? It's clear that each line can't contain more elements than an upper bound, because they eventually become -1 and don't get stored and each line only depends on the previous one. Due to the lazy evaluation only the head of each line is computed until the last line needs it's second element and all the trunks along the way are stored...
I have a very efficient implementation in c++ but I am really wondering if there is a way to get it done in haskell, too.
Works for me: What Haskell implementation are you using? A naive program to calculate this triangle works fine for me in GHC 6.10.4. I can print the 1000th row just fine:
nextRow :: [Integer] -> [Integer]
nextRow row = 0 : [a + b + 1 | (a, b) <- zip row (tail row ++ [last row])]
tri = iterate nextRow [0]
main = putStrLn $ show $ tri !! 1000 -- print 1000th row
I can even print the first 10 numbers in row 100000 without overflowing the stack. I'm not sure what's going wrong for you. The global name tri might be keeping the whole triangle of results alive, but even if it is, that seems relatively harmless.
How to force order of evaluation: You can force thunks to be evaluated in a certain order using the Prelude function seq (which is a magic function that can't be implemented in terms of Haskell's other basic features). If you tell Haskell to print a `seq` b, it first evaluates the thunk for a, then evaluates and prints b.
Note that seq is shallow: it only does enough evaluation to force a to no longer be a thunk. If a is of a tuple type, the result might still be a tuple of thunks. If it's a list, the result might be a cons cell having thunks for both the head and the tail.
It seems like you shouldn't need to do this for such a simple problem; a few thousand thunks shouldn't be too much for any reasonable implementation. But it would go like this:
-- Evaluate a whole list of thunks before calculating `result`.
-- This returns `result`.
seqList :: [b] -> a -> a
seqList lst result = foldr seq result lst
-- Exactly the same as `nextRow`, but compute every element of `row`
-- before calculating any element of the next row.
nextRow' :: [Integer] -> [Integer]
nextRow' row = row `seqList` nextRow row
tri = iterate nextRow' [0]
The fold in seqList basically expands to lst!!0 `seq` lst!!1 `seq` lst!!2 `seq` ... `seq` result.
This is much slower for me when printing just the first 10 elements of row 100,000. I think that's because it requires computing 99,999 complete rows of the triangle.

Resources