I am currently trying to refresh my Haskell knowledge by solving some Hackerrank problems.
For example:
https://www.hackerrank.com/challenges/maximum-palindromes/problem
I've already implemented an imperative solution in C++ which got accepted for all test cases. Now I am trying to come up with a pure functional solution in (reasonably idiomatic) Haskell.
My current code is
module Main where
import Control.Monad
import qualified Data.ByteString.Char8 as C
import Data.Bits
import Data.List
import qualified Data.Map.Strict as Map
import qualified Data.IntMap.Strict as IntMap
import Debug.Trace
-- precompute factorials
compFactorials :: Int -> Int -> IntMap.IntMap Int
compFactorials n m = go 0 1 IntMap.empty
where
go a acc map
| a < 0 = map
| a < n = go a' acc' map'
| otherwise = map'
where
map' = IntMap.insert a acc map
a' = a + 1
acc' = (acc * a') `mod` m
-- precompute invs
compInvs :: Int -> Int -> IntMap.IntMap Int -> IntMap.IntMap Int
compInvs n m facts = go 0 IntMap.empty
where
go a map
| a < 0 = map
| a < n = go a' map'
| otherwise = map'
where
map' = IntMap.insert a v map
a' = a + 1
v = (modExp b (m-2) m) `mod` m
b = (IntMap.!) facts a
modExp :: Int -> Int -> Int -> Int
modExp b e m = go b e 1
where
go b e r
| (.&.) e 1 == 1 = go b' e' r'
| e > 0 = go b' e' r
| otherwise = r
where
r' = (r * b) `mod` m
b' = (b * b) `mod` m
e' = shift e (-1)
-- precompute frequency table
initFreqMap :: C.ByteString -> Map.Map Char (IntMap.IntMap Int)
initFreqMap inp = go 1 map1 map2 inp
where
map1 = Map.fromList $ zip ['a'..'z'] $ repeat 0
map2 = Map.fromList $ zip ['a'..'z'] $ repeat IntMap.empty
go idx m1 m2 inp
| C.null inp = m2
| otherwise = go (idx+1) m1' m2' $ C.tail inp
where
m1' = Map.update (\v -> Just $ v+1) (C.head inp) m1
m2' = foldl' (\m w -> Map.update (\v -> liftM (\c -> IntMap.insert idx c v) $ Map.lookup w m1') w m)
m2 ['a'..'z']
query :: Int -> Int -> Int -> Map.Map Char (IntMap.IntMap Int)
-> IntMap.IntMap Int -> IntMap.IntMap Int -> Int
query l r m freqMap facts invs
| x > 1 = (x * y) `mod` m
| otherwise = y
where
calcCnt cs = cr - cl
where
cl = IntMap.findWithDefault 0 (l-1) cs
cr = IntMap.findWithDefault 0 r cs
f1 acc cs
| even cnt = acc
| otherwise = acc + 1
where
cnt = calcCnt cs
f2 (acc1,acc2) cs
| cnt < 2 = (acc1 ,acc2)
| otherwise = (acc1',acc2')
where
cnt = calcCnt cs
n = cnt `div` 2
acc1' = acc1 + n
r = choose acc1' n
acc2' = (acc2 * r) `mod` m
-- calc binomial coefficient using Fermat's little theorem
choose n k
| n < k = 0
| otherwise = (f1 * t) `mod` m
where
f1 = (IntMap.!) facts n
i1 = (IntMap.!) invs k
i2 = (IntMap.!) invs (n-k)
t = (i1 * i2) `mod` m
x = Map.foldl' f1 0 freqMap
y = snd $ Map.foldl' f2 (0,1) freqMap
main :: IO()
main = do
inp <- C.getLine
q <- readLn :: IO Int
let modulo = 1000000007
let facts = compFactorials (C.length inp) modulo
let invs = compInvs (C.length inp) modulo facts
let freqMap = initFreqMap inp
forM_ [1..q] $ \_ -> do
line <- getLine
let [s1, s2] = words line
let l = (read s1) :: Int
let r = (read s2) :: Int
let result = query l r modulo freqMap facts invs
putStrLn $ show result
It passes all small and medium test cases but I am getting timeout with large test cases.
The key to solve this problem is to precompute some stuff once at the beginning and use them to answer the individual queries efficiently.
Now, my main problem where I need help is:
The initital profiling shows that the lookup operation of the IntMap seems to be the main bottleneck. Is there better alternative to IntMap for memoization? Or should I look at Vector or Array, which I believe will lead to more "ugly" code.
Even in current state, the code doesn't look nice (by functional standards) and as verbose as my C++ solution. Any tips to make it more idiomatic? Other than IntMap usage for memoization, do you spot any other obvious problems which can lead to performance problems?
And is there any good sources, where I can learn how to use Haskell more effectively for competitive programming?
A sample large testcase, where the current code gets timeout:
input.txt
output.txt
For comparison my C++ solution:
#include <vector>
#include <iostream>
#define MOD 1000000007L
long mod_exp(long b, long e) {
long r = 1;
while (e > 0) {
if ((e & 1) == 1) {
r = (r * b) % MOD;
}
b = (b * b) % MOD;
e >>= 1;
}
return r;
}
long n_choose_k(int n, int k, const std::vector<long> &fact_map, const std::vector<long> &inv_map) {
if (n < k) {
return 0;
}
long l1 = fact_map[n];
long l2 = (inv_map[k] * inv_map[n-k]) % MOD;
return (l1 * l2) % MOD;
}
int main() {
std::string s;
int q;
std::cin >> s >> q;
std::vector<std::vector<long>> freq_map;
std::vector<long> fact_map(s.size()+1);
std::vector<long> inv_map(s.size()+1);
for (int i = 0; i < 26; i++) {
freq_map.emplace_back(std::vector<long>(s.size(), 0));
}
std::vector<long> acc_map(26, 0);
for (int i = 0; i < s.size(); i++) {
acc_map[s[i]-'a']++;
for (int j = 0; j < 26; j++) {
freq_map[j][i] = acc_map[j];
}
}
fact_map[0] = 1;
inv_map[0] = 1;
for (int i = 1; i <= s.size(); i++) {
fact_map[i] = (i * fact_map[i-1]) % MOD;
inv_map[i] = mod_exp(fact_map[i], MOD-2) % MOD;
}
while (q--) {
int l, r;
std::cin >> l >> r;
std::vector<long> x(26, 0);
long t = 0;
long acc = 0;
long result = 1;
for (int i = 0; i < 26; i++) {
auto cnt = freq_map[i][r-1] - (l > 1 ? freq_map[i][l-2] : 0);
if (cnt % 2 != 0) {
t++;
}
long n = cnt / 2;
if (n > 0) {
acc += n;
result *= n_choose_k(acc, n, fact_map, inv_map);
result = result % MOD;
}
}
if (t > 0) {
result *= t;
result = result % MOD;
}
std::cout << result << std::endl;
}
}
UPDATE:
DanielWagner's answer has confirmed my suspicion that the main problem in my code was the usage of IntMap for memoization. Replacing IntMap with Array made my code perform similar to DanielWagner's solution.
module Main where
import Control.Monad
import Data.Array (Array)
import qualified Data.Array as A
import qualified Data.ByteString.Char8 as C
import Data.Bits
import Data.List
import Debug.Trace
-- precompute factorials
compFactorials :: Int -> Int -> Array Int Int
compFactorials n m = A.listArray (0,n) $ scanl' f 1 [1..n]
where
f acc a = (acc * a) `mod` m
-- precompute invs
compInvs :: Int -> Int -> Array Int Int -> Array Int Int
compInvs n m facts = A.listArray (0,n) $ map f [0..n]
where
f a = (modExp ((A.!) facts a) (m-2) m) `mod` m
modExp :: Int -> Int -> Int -> Int
modExp b e m = go b e 1
where
go b e r
| (.&.) e 1 == 1 = go b' e' r'
| e > 0 = go b' e' r
| otherwise = r
where
r' = (r * b) `mod` m
b' = (b * b) `mod` m
e' = shift e (-1)
-- precompute frequency table
initFreqMap :: C.ByteString -> Map.Map Char (Array Int Int)
initFreqMap inp = Map.fromList $ map f ['a'..'z']
where
n = C.length inp
f c = (c, A.listArray (0,n) $ scanl' g 0 [0..n-1])
where
g x j
| C.index inp j == c = x+1
| otherwise = x
query :: Int -> Int -> Int -> Map.Map Char (Array Int Int)
-> Array Int Int -> Array Int Int -> Int
query l r m freqMap facts invs
| x > 1 = (x * y) `mod` m
| otherwise = y
where
calcCnt freqMap = cr - cl
where
cl = (A.!) freqMap (l-1)
cr = (A.!) freqMap r
f1 acc cs
| even cnt = acc
| otherwise = acc + 1
where
cnt = calcCnt cs
f2 (acc1,acc2) cs
| cnt < 2 = (acc1 ,acc2)
| otherwise = (acc1',acc2')
where
cnt = calcCnt cs
n = cnt `div` 2
acc1' = acc1 + n
r = choose acc1' n
acc2' = (acc2 * r) `mod` m
-- calc binomial coefficient using Fermat's little theorem
choose n k
| n < k = 0
| otherwise = (f1 * t) `mod` m
where
f1 = (A.!) facts n
i1 = (A.!) invs k
i2 = (A.!) invs (n-k)
t = (i1 * i2) `mod` m
x = Map.foldl' f1 0 freqMap
y = snd $ Map.foldl' f2 (0,1) freqMap
main :: IO()
main = do
inp <- C.getLine
q <- readLn :: IO Int
let modulo = 1000000007
let facts = compFactorials (C.length inp) modulo
let invs = compInvs (C.length inp) modulo facts
let freqMap = initFreqMap inp
replicateM_ q $ do
line <- getLine
let [s1, s2] = words line
let l = (read s1) :: Int
let r = (read s2) :: Int
let result = query l r modulo freqMap facts invs
putStrLn $ show result
I think you've shot yourself in the foot by trying to be too clever. Below I'll show a straightforward implementation of a slightly different algorithm that is about 5x faster than your Haskell code.
Here's the core combinatoric computation. Given a character frequency count for a substring, we can compute the number of maximum-length palindromes this way:
Divide all the frequencies by two, rounding down; call this the div2-frequencies. We'll also want the mod2-frequencies, which is the set of letters for which we had to round down.
Sum the div2-frequencies to get the total length of the palindrome prefix; its factorial gives an overcount of the number of possible prefixes for the palindrome.
Take the product of the factorials of the div2-frequencies. This tells the factor by which we overcounted above.
Take the size of the mod2-frequencies, or choose 1 if there are none. We can extend any of the palindrome prefixes by one of the values in this set, if there are any, so we have to multiply by this size.
For the overcounting step, it's not super obvious to me whether it would be faster to store precomputed inverses for factorials, and take their product, or whether it's faster to just take the product of all the factorials and do one inverse operation at the very end. I'll do the latter, because it just intuitively seems faster to do one inversion per query than one lookup per repeated letter, but what do I know? Should be easy to test if you want to try to adapt the code yourself.
There's only one other quick insight I had vs. your code, which is that we can cache the frequency counts for prefixes of the input; then computing the frequency count for a substring is just pointwise subtraction of two cached counts. Your precomputation on the input I find to be a bit excessive in comparison.
Without further ado, let's see some code. As usual there's some preamble.
module Main where
import Control.Monad
import Data.Array (Array)
import qualified Data.Array as A
import Data.Map.Strict (Map)
import qualified Data.Map.Strict as M
import Data.Monoid
Like you, I want to do all my computations on cheap Ints and bake in the modular operations where possible. I'll make a newtype to make sure this happens for me.
newtype Mod1000000007 = Mod Int deriving (Eq, Ord)
instance Num Mod1000000007 where
fromInteger = Mod . (`mod` 1000000007) . fromInteger
Mod l + Mod r = Mod ((l+r) `rem` 1000000007)
Mod l * Mod r = Mod ((l*r) `rem` 1000000007)
negate (Mod v) = Mod ((1000000007 - v) `rem` 1000000007)
abs = id
signum = id
instance Integral Mod1000000007 where
toInteger (Mod n) = toInteger n
quotRem a b = (a * b^1000000005, 0)
I baked in the base of 1000000007 in several places, but it's easy to generalize by giving Mod a phantom parameter and making a HasBase class to pick the base. Ask a fresh question if you're not sure how and are interested; I'll be happy to do a more thorough writeup. There's a few more instances for Mod that are basically uninteresting and primarily needed because of Haskell's wacko numeric class hierarchy:
instance Show Mod1000000007 where show (Mod n) = show n
instance Real Mod1000000007 where toRational (Mod n) = toRational n
instance Enum Mod1000000007 where
toEnum = Mod . (`mod` 1000000007)
fromEnum (Mod n) = n
Here's the precomputation we want to do for factorials...
type FactMap = Array Int Mod1000000007
factMap :: Int -> FactMap
factMap n = A.listArray (0,n) (scanl (*) 1 [1..])
...and for precomputing frequency maps for each prefix, plus getting a frequency map given a start and end point.
type FreqMap = Map Char Int
freqMaps :: String -> Array Int FreqMap
freqMaps s = go where
go = A.listArray (0, length s)
(M.empty : [M.insertWith (+) c 1 (go A.! i) | (i, c) <- zip [0..] s])
substringFreqMap :: Array Int FreqMap -> Int -> Int -> FreqMap
substringFreqMap maps l r = M.unionWith (-) (maps A.! r) (maps A.! (l-1))
Implementing the core computation described above is just a few lines of code, now that we have suitable Num and Integral instances for Mod1000000007:
palindromeCount :: FactMap -> FreqMap -> Mod1000000007
palindromeCount facts freqs
= toEnum (max 1 mod2Freqs)
* (facts A.! sum div2Freqs)
`div` product (map (facts A.!) div2Freqs)
where
(div2Freqs, Sum mod2Freqs) = foldMap (\n -> ([n `quot` 2], Sum (n `rem` 2))) freqs
Now we just need a short driver to read stuff and pass it around to the appropriate functions.
main :: IO ()
main = do
inp <- getLine
q <- readLn
let freqs = freqMaps inp
facts = factMap (length inp)
replicateM_ q $ do
[l,r] <- map read . words <$> getLine
print . palindromeCount facts $ substringFreqMap freqs l r
That's it. Notably I made no attempt to be fancy about bitwise operations and didn't do anything fancy with accumulators; everything is in what I would consider idiomatic purely-functional style. The final count is about half as much code that runs about 5x faster.
P.S. Just for fun, I replaced the last line with print (l+r :: Int)... and discovered that about half the time is spent in read. Ouch! Seems there's still plenty of low-hanging fruit if this isn't fast enough yet.
I have implemented a FIR filter in Haskell. I don't know that much about FIR filters and my code is heavily based on an existing C# implementation. Therefore, I have a feeling that my implementation is has too much of a C# style and is not really Haskell-like. I would like to know if there is a more idiomatic Haskell way of implementing my code. Ideally, I'm lucky for some combination of higher-order functions (map, filter, fold, etc.) that implement the algorithm.
My Haskell code looks like this:
applyFIR :: Vector Double -> Vector Double -> Vector Double
applyFIR b x = generate (U.length x) help
where
help i = if i >= (U.length b - 1) then loop i (U.length b - 1) else 0
loop yi bi = if bi < 0 then 0 else b !! bi * x !! (yi-bi) + loop yi (bi-1)
vec !! i = unsafeIndex vec i -- Shorthand for unsafeIndex
This code is based on the following C# code:
public float[] RunFilter(double[] x)
{
int M = coeff.Length;
int n = x.Length;
//y[n]=b0x[n]+b1x[n-1]+....bmx[n-M]
var y = new float[n];
for (int yi = 0; yi < n; yi++)
{
double t = 0.0f;
for (int bi = M - 1; bi >= 0; bi--)
{
if (yi - bi < 0) continue;
t += coeff[bi] * x[yi - bi];
}
y[yi] = (float) t;
}
return y;
}
As you can see, it's almost a straight copy. How can I turn my implementation into a more Haskell-like one? Do you have any ideas? The only thing I could come up with was using Vector.generate.
I know that the DSP library has an implementation available. But it uses lists and is way too slow for my use case. This Vector implementation is a lot faster than the one in DSP.
I've also tried implementing the algorithm using Repa. It is faster than the Vector implementation. Here is the result:
applyFIR :: V.Vector Float -> Array U DIM1 Float -> Array D DIM1 Float
applyFIR b x = R.traverse x id (\_ (Z :. i) -> if i >= len then loop i (len - 1) else 0)
where
len = V.length b
loop :: Int -> Int -> Float
loop yi bi = if bi < 0 then 0 else (V.unsafeIndex b bi) * x !! (Z :. (yi-bi)) + loop yi (bi-1)
arr !! i = unsafeIndex arr i
First of all, I don't think that your initial vector code is a faithful translation - that is, I think it disagrees with the C# code. For example, suppose that both "x" and "b" ("b" is coeff in C#) have length 3, and have all values of 1.0. Then for y[0] the C# code would produce x[0] * coeff[0], or 1.0. (it would hit continue for all other values of bi)
With your Haskell code, however, help 0 produces 0. Your Repa version seems to suffer from the same problem.
So let's start with a more faithful translation:
applyFIR :: Vector Double -> Vector Double -> Vector Double
applyFIR b x = generate (U.length x) help
where
help i = loop i (min i $ U.length b - 1)
loop yi bi = if bi < 0 then 0 else b !! bi * x !! (yi-bi) + loop yi (bi-1)
vec !! i = unsafeIndex vec i -- Shorthand for unsafeIndex
Now, you're basically doing a calculation like this for computing, say, y[3]:
... b[3] | b[2] | b[1] | b[0]
x[0] | x[1] | x[2] | x[3] | x[4] | x[5] | ....
multiply
b[3]*x[0]|b[2]*x[1] |b[1]*x[2] |b[0]*x[3]
sum
y[3] = b[3]*x[0] + b[2]*x[1] + b[1]*x[2] + b[0]*x[3]
So one way to think of what you're doing is "take the b vector, reverse it, and to compute spot i of the result, line b[0] up with x[i], multiply all the corresponding x and b entries, and compute the sum".
So let's do that:
applyFIR :: Vector Double -> Vector Double -> Vector Double
applyFIR b x = generate (U.length x) help
where
revB = U.reverse b
bLen = U.length b
help i = let sliceLen = min (i+1) bLen
bSlice = U.slice (bLen - sliceLen) sliceLen revB
xSlice = U.slice (i + 1 - sliceLen) sliceLen x
in U.sum $ U.zipWith (*) bSlice xSlice
I have been tried to do the Morris Pratt table and the code is basically this one in C:
void preMp(char *x, int m, int mpNext[]) {
int i, j;
i = 0;
j = mpNext[0] = -1;
while (i < m) {
while (j > -1 && x[i] != x[j])
j = mpNext[j];
mpNext[++i] = ++j;
}
}
and here is where i get so far in Fortran
program MP_ALGORITHM
implicit none
integer, parameter :: m=4
character(LEN=m) :: x='abac'
integer, dimension(4) :: T
integer :: i, j
i=0
T(1)=-1
j=-1
do while(i < m)
do while((j > -1) .AND. (x(i+1:i+1) /= (x(j+i+1:j+i+1))))
j=T(j)
end do
i=i+1
j=j+1
T(i)=j
end do
print *, T(1:)
end program MP_ALGORITHM
and the problem is i think i am having the wrong output.
for x=abac it should be (?):
a b a c
-1 0 1 0
and my code is returning 0 1 1 1
so, what i've done wrong?
The problem here is that C indices start from zero, but Fortran indices start from one. You can try to adjust the index for every array acces by one, but this will get unwieldy.
The Morris-Pratt table itself is an array of indices, so it should look different in C and Fortran: The Fortran array should have one-based indices and it should use zero as invalid index.
Together with the error that chw21 pointed out, your function might look like this:
subroutine kmp_table(x, t)
implicit none
character(*), intent(in) :: x
integer, dimension(:), intent(out) :: t
integer m
integer :: i, j
m = len(x)
i = 1
t(1) = 0
j = 0
do while (i < m)
do while(j > 0 .and. x(i:i) /= x(j:j))
j = t(j)
end do
i = i + 1
j = j + 1
t(i) = j
end do
end subroutine
You can then use it in the Morris-Pratt algorithm as taken straight from the Wikipedia page with adjustment for Fortran indices:
function kmp_index(S, W) result(res)
implicit none
integer :: res
character(*), intent(in) :: S ! text to search
character(*), intent(in) :: W ! word to find
integer :: m ! zero-based offset in S
integer :: i ! one-based offset in W and T
integer, dimension(len(W)) :: T ! KMP table
call kmp_table(W, T)
i = 1
m = 0
do while (m + i <= len(S))
if (W(i:i) == S(m + i:m + i)) then
if (i == len(W)) then
res = m + 1
return
end if
i = i + 1
else
if (T(i) > 0) then
m = m + i - T(i)
i = T(i)
else
i = 1
m = m + 1
end if
end if
end do
res = 0
end function
(The index m is zero-based here, because t is only ever used in conjunction with i in S(m + i:m + i). Adding two one-based indices will yield an offset of one, whereas keeping m zero-based makes this a neutral addition. m is a local variable that isn't exposed to code from the outside.)
Alternatively, you could make your Fortran arrays zero-based by specifying a lower bound of zero for your string and array. That will clash with the useful character(*) notation, though, which always uses one-based indexing. In my opinion, it is better to think about the whole algorithm in the typical one-based indexing scheme of Fortran.
this site isn't really a debugging site. Normally I would suggest you have a look at how to debug code. It didn't take me very long to go through your code with a pen and paper and verify that that is indeed the table it produces.
Still, here are a few pointers:
The C code compares x[i] and x[j], but you compare x[i] and x[i+j] in your Fortran code, more or less.
Integer arrays usually also start at index 1 in Fortran. So just like adding one to the index in the x String, you also need to add 1 every time you access T anywhere.
I am getting a trouble finding an approach to solve this problem.
Input-output sequences are as follows :
**input1 :** aaagctgctagag
**output1 :** a3gct2ag2
**input2 :** aaaaaaagctaagctaag
**output2 :** a6agcta2ag
Input nsequence can be of 10^6 characters and largest continuous patterns will be considered.
For example for input2 "agctaagcta" output will not be "agcta2gcta" but it will be "agcta2".
Any help appreciated.
Explanation of the algorithm:
Having a sequence S with symbols s(1), s(2),…, s(N).
Let B(i) be the best compressed sequence with elements s(1), s(2),…,s(i).
So, for example, B(3) will be the best compressed sequence for s(1), s(2), s(3).
What we want to know is B(N).
To find it, we will proceed by induction. We want to calculate B(i+1), knowing B(i), B(i-1), B(i-2), …, B(1), B(0), where B(0) is empty sequence, and and B(1) = s(1). At the same time, this constitutes a proof that the solution is optimal. ;)
To calculate B(i+1), we will pick the best sequence among the candidates:
Candidate sequences where the last block has one element:
B(i )s(i+1)1
B(i-1)s(i+1)2 ; only if s(i) = s(i+1)
B(i-2)s(i+1)3 ; only if s(i-1) = s(i) and s(i) = s(i+1)
…
B(1)s(i+1)[i-1] ; only if s(2)=s(3) and s(3)=s(4) and … and s(i) = s(i+1)
B(0)s(i+1)i = s(i+1)i ; only if s(1)=s(2) and s(2)=s(3) and … and s(i) = s(i+1)
Candidate sequences where the last block has 2 elements:
B(i-1)s(i)s(i+1)1
B(i-3)s(i)s(i+1)2 ; only if s(i-2)s(i-1)=s(i)s(i+1)
B(i-5)s(i)s(i+1)3 ; only if s(i-4)s(i-3)=s(i-2)s(i-1) and s(i-2)s(i-1)=s(i)s(i+1)
…
Candidate sequences where the last block has 3 elements:
…
Candidate sequences where the last block has 4 elements:
…
…
Candidate sequences where last block has n+1 elements:
s(1)s(2)s(3)………s(i+1)
For each possibility, the algorithm stops when the sequence block is no longer repeated. And that’s it.
The algorithm will be some thing like this in psude-c code:
B(0) = “”
for (i=1; i<=N; i++) {
// Calculate all the candidates for B(i)
BestCandidate=null
for (j=1; j<=i; j++) {
Calculate all the candidates of length (i)
r=1;
do {
Candidadte = B([i-j]*r-1) s(i-j+1)…s(i-1)s(i) r
If ( (BestCandidate==null)
|| (Candidate is shorter that BestCandidate))
{
BestCandidate=Candidate.
}
r++;
} while ( ([i-j]*r <= i)
&&(s(i-j*r+1) s(i-j*r+2)…s(i-j*r+j) == s(i-j+1) s(i-j+2)…s(i-j+j))
}
B(i)=BestCandidate
}
Hope that this can help a little more.
The full C program performing the required task is given below. It runs in O(n^2). The central part is only 30 lines of code.
EDIT I have restructured a little bit the code, changed the names of the variables and added some comment in order to be more readable.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>
// This struct represents a compressed segment like atg4, g3, agc1
struct Segment {
char *elements;
int nElements;
int count;
};
// As an example, for the segment agagt3 elements would be:
// {
// elements: "agagt",
// nElements: 5,
// count: 3
// }
struct Sequence {
struct Segment lastSegment;
struct Sequence *prev; // Points to a sequence without the last segment or NULL if it is the first segment
int totalLen; // Total length of the compressed sequence.
};
// as an example, for the sequence agt32ta5, the representation will be:
// {
// lastSegment:{"ta" , 2 , 5},
// prev: #A,
// totalLen: 8
// }
// and A will be
// {
// lastSegment{ "agt", 3, 32},
// prev: NULL,
// totalLen: 5
// }
// This function converts a sequence to a string.
// You have to free the string after using it.
// The strategy is to construct the string from right to left.
char *sequence2string(struct Sequence *S) {
char *Res=malloc(S->totalLen + 1);
char *digits="0123456789";
int p= S->totalLen;
Res[p]=0;
while (S!=NULL) {
// first we insert the count of the last element.
// We do digit by digit starting with the units.
int C = S->lastSegment.count;
while (C) {
p--;
Res[p] = digits[ C % 10 ];
C /= 10;
}
p -= S->lastSegment.nElements;
strncpy(Res + p , S->lastSegment.elements, S->lastSegment.nElements);
S = S ->prev;
}
return Res;
}
// Compresses a dna sequence.
// Returns a string with the in sequence compressed.
// The returned string must be freed after using it.
char *dnaCompress(char *in) {
int i,j;
int N = strlen(in);; // Number of elements of a in sequence.
// B is an array of N+1 sequences where B(i) is the best compressed sequence sequence of the first i characters.
// What we want to return is B[N];
struct Sequence *B;
B = malloc((N+1) * sizeof (struct Sequence));
// We first do an initialization for i=0
B[0].lastSegment.elements="";
B[0].lastSegment.nElements=0;
B[0].lastSegment.count=0;
B[0].prev = NULL;
B[0].totalLen=0;
// and set totalLen of all the sequences to a very HIGH VALUE in this case N*2 will be enougth, We will try different sequences and keep the minimum one.
for (i=1; i<=N; i++) B[i].totalLen = INT_MAX; // A very high value
for (i=1; i<=N; i++) {
// at this point we want to calculate B[i] and we know B[i-1], B[i-2], .... ,B[0]
for (j=1; j<=i; j++) {
// Here we will check all the candidates where the last segment has j elements
int r=1; // number of times the last segment is repeated
int rNDigits=1; // Number of digits of r
int rNDigitsBound=10; // We will increment r, so this value is when r will have an extra digit.
// when r = 0,1,...,9 => rNDigitsBound = 10
// when r = 10,11,...,99 => rNDigitsBound = 100
// when r = 100,101,.,999 => rNDigitsBound = 1000 and so on.
do {
// Here we analitze a candidate B(i).
// where the las segment has j elements repeated r times.
int CandidateLen = B[i-j*r].totalLen + j + rNDigits;
if (CandidateLen < B[i].totalLen) {
B[i].lastSegment.elements = in + i - j*r;
B[i].lastSegment.nElements = j;
B[i].lastSegment.count = r;
B[i].prev = &(B[i-j*r]);
B[i].totalLen = CandidateLen;
}
r++;
if (r == rNDigitsBound ) {
rNDigits++;
rNDigitsBound *= 10;
}
} while ( (i - j*r >= 0)
&& (strncmp(in + i -j, in + i - j*r, j)==0));
}
}
char *Res=sequence2string(&(B[N]));
free(B);
return Res;
}
int main(int argc, char** argv) {
char *compressedDNA=dnaCompress(argv[1]);
puts(compressedDNA);
free(compressedDNA);
return 0;
}
Forget Ukonnen. Dynamic programming it is. With 3-dimensional table:
sequence position
subsequence size
number of segments
TERMINOLOGY: For example, having a = "aaagctgctagag", sequence position coordinate would run from 1 to 13. At sequence position 3 (letter 'g'), having subsequence size 4, the subsequence would be "gctg". Understood? And as for the number of segments, then expressing a as "aaagctgctagag1" consists of 1 segment (the sequence itself). Expressing it as "a3gct2ag2" consists of 3 segments. "aaagctgct1ag2" consists of 2 segments. "a2a1ctg2ag2" would consist of 4 segments. Understood? Now, with this, you start filling a 3-dimensional array 13 x 13 x 13, so your time and memory complexity seems to be around n ** 3 for this. Are you sure you can handle it for million-bp sequences? I think that greedy approach would be better, because large DNA sequences are unlikely to repeat exactly. And, I would suggest that you widen your assignment to approximate matches, and you can publish it straight in a journal.
Anyway, you will start filling the table of compressing a subsequence starting at some position (dimension 1) with length equal to dimension 2 coordinate, having at most dimension 3 segments. So you first fill the first row, representing compressions of subsequences of length 1 consisting of at most 1 segment:
a a a g c t g c t a g a g
1(a1) 1(a1) 1(a1) 1(g1) 1(c1) 1(t1) 1(g1) 1(c1) 1(t1) 1(a1) 1(g1) 1(a1) 1(g1)
The number is the character cost (always 1 for these trivial 1-char sequences; number 1 does not count into the character cost), and in the parenthesis, you have the compression (also trivial for this simple case). The second row will be still simple:
2(a2) 2(a2) 2(ag1) 2(gc1) 2(ct1) 2(tg1) 2(gc1) 2(ct1) 2(ta1) 2(ag1) 2(ga1) 2(ag1)
There is only 1 way to decompose a 2-character sequence into 2 subsequences -- 1 character + 1 character. If they are identical, the result is like a + a = a2. If they are different, such as a + g, then, because only 1-segment sequences are admissible, the result cannot be a1g1, but must be ag1. The third row will be finally more interesting:
2(a3) 2(aag1) 3(agc1) 3(gct1) 3(ctg1) 3(tgc1) 3(gct1) 3(cta1) 3(tag1) 3(aga1) 3(gag1)
Here, you can always choose between 2 ways of composing the compressed string. For example, aag can be composed either as aa + g or a + ag. But again, we cannot have 2 segments, as in aa1g1 or a1ag1, so we must be satisfied with aag1, unless both components consist of the same character, as in aa + a => a3, with character cost 2. We can continue onto 4 th line:
4(aaag1) 4(aagc1) 4(agct1) 4(gctg1) 4(ctgc1) 4(tgct1) 4(gcta1) 4(ctag1) 4(taga1) 3(ag2)
Here, on the first position, we cannot use a3g1, because only 1 segment is allowed at this layer. But at the last position, compression to character cost 3 is agchieved by ag1 + ag1 = ag2. This way, one can fill the whole first-level table all the way up to the single subsequence of 13 characters, and each subsequence will have its optimal character cost and its compression under the first-level constraint of at most 1 segment associated with it.
Then you go to the 2nd level, where 2 segments are allowed... And again, from the bottom up, you identify the optimum cost and compression of each table coordinate under the given level's segment count constraint, by comparing all the possible ways to compose the subsequence using already computed positions, until you fill the table completely and thus compute the global optimum. There are some details to solve, but sorry, I'm not gonna code this for you.
After trying my own way for a while, my kudos to jbaylina for his beautiful algorithm and C implementation. Here's my attempted version of jbaylina's algorithm in Haskell, and below it further development of my attempt at a linear-time algorithm that attempts to compress segments that include repeated patterns in a one-by-one fashion:
import Data.Map (fromList, insert, size, (!))
compress s = (foldl f (fromList [(0,([],0)),(1,([s!!0],1))]) [1..n - 1]) ! n
where
n = length s
f b i = insert (size b) bestCandidate b where
add (sequence, sLength) (sequence', sLength') =
(sequence ++ sequence', sLength + sLength')
j' = [1..min 100 i]
bestCandidate = foldr combCandidates (b!i `add` ([s!!i,'1'],2)) j'
combCandidates j candidate' =
let nextCandidate' = comb 2 (b!(i - j + 1)
`add` ((take j . drop (i - j + 1) $ s) ++ "1", j + 1))
in if snd nextCandidate' <= snd candidate'
then nextCandidate'
else candidate' where
comb r candidate
| r > uBound = candidate
| not (strcmp r True) = candidate
| snd nextCandidate <= snd candidate = comb (r + 1) nextCandidate
| otherwise = comb (r + 1) candidate
where
uBound = div (i + 1) j
prev = b!(i - r * j + 1)
nextCandidate = prev `add`
((take j . drop (i - j + 1) $ s) ++ show r, j + length (show r))
strcmp 1 _ = True
strcmp num bool
| (take j . drop (i - num * j + 1) $ s)
== (take j . drop (i - (num - 1) * j + 1) $ s) =
strcmp (num - 1) True
| otherwise = False
Output:
*Main> compress "aaagctgctagag"
("a3gct2ag2",9)
*Main> compress "aaabbbaaabbbaaabbbaaabbb"
("aaabbb4",7)
Linear-time attempt:
import Data.List (sortBy)
group' xxs sAccum (chr, count)
| null xxs = if null chr
then singles
else if count <= 2
then reverse sAccum ++ multiples ++ "1"
else singles ++ if null chr then [] else chr ++ show count
| [x] == chr = group' xs sAccum (chr,count + 1)
| otherwise = if null chr
then group' xs (sAccum) ([x],1)
else if count <= 2
then group' xs (multiples ++ sAccum) ([x],1)
else singles
++ chr ++ show count ++ group' xs [] ([x],1)
where x:xs = xxs
singles = reverse sAccum ++ (if null sAccum then [] else "1")
multiples = concat (replicate count chr)
sequences ws strIndex maxSeqLen = repeated' where
half = if null . drop (2 * maxSeqLen - 1) $ ws
then div (length ws) 2 else maxSeqLen
repeated' = let (sequence,(sequenceStart, sequenceEnd'),notSinglesFlag) = repeated
in (sequence,(sequenceStart, sequenceEnd'))
repeated = foldr divide ([],(strIndex,strIndex),False) [1..half]
equalChunksOf t a = takeWhile(==t) . map (take a) . iterate (drop a)
divide chunkSize b#(sequence,(sequenceStart, sequenceEnd'),notSinglesFlag) =
let t = take (2*chunkSize) ws
t' = take chunkSize t
in if t' == drop chunkSize t
then let ts = equalChunksOf t' chunkSize ws
lenTs = length ts
sequenceEnd = strIndex + lenTs * chunkSize
newEnd = if sequenceEnd > sequenceEnd'
then sequenceEnd else sequenceEnd'
in if chunkSize > 1
then if length (group' (concat (replicate lenTs t')) [] ([],0)) > length (t' ++ show lenTs)
then (((strIndex,sequenceEnd,chunkSize,lenTs),t'):sequence, (sequenceStart,newEnd),True)
else b
else if notSinglesFlag
then b
else (((strIndex,sequenceEnd,chunkSize,lenTs),t'):sequence, (sequenceStart,newEnd),False)
else b
addOne a b
| null (fst b) = a
| null (fst a) = b
| otherwise =
let (((start,end,patLen,lenS),sequence):rest,(sStart,sEnd)) = a
(((start',end',patLen',lenS'),sequence'):rest',(sStart',sEnd')) = b
in if sStart' < sEnd && sEnd < sEnd'
then let c = ((start,end,patLen,lenS),sequence):rest
d = ((start',end',patLen',lenS'),sequence'):rest'
in (c ++ d, (sStart, sEnd'))
else a
segment xs baseIndex maxSeqLen = segment' xs baseIndex baseIndex where
segment' zzs#(z:zs) strIndex farthest
| null zs = initial
| strIndex >= farthest && strIndex > 0 = ([],(0,0))
| otherwise = addOne initial next
where
next#(s',(start',end')) = segment' zs (strIndex + 1) farthest'
farthest' | null s = farthest
| otherwise = if start /= end && end > farthest then end else farthest
initial#(s,(start,end)) = sequences zzs strIndex maxSeqLen
areExclusive ((a,b,_,_),_) ((a',b',_,_),_) = (a' >= b) || (b' <= a)
combs [] r = [r]
combs (x:xs) r
| null r = combs xs (x:r) ++ if null xs then [] else combs xs r
| otherwise = if areExclusive (head r) x
then combs xs (x:r) ++ combs xs r
else if l' > lowerBound
then combs xs (x: reduced : drop 1 r) ++ combs xs r
else combs xs r
where lowerBound = l + 2 * patLen
((l,u,patLen,lenS),s) = head r
((l',u',patLen',lenS'),s') = x
reduce = takeWhile (>=l') . iterate (\x -> x - patLen) $ u
lenReduced = length reduce
reduced = ((l,u - lenReduced * patLen,patLen,lenS - lenReduced),s)
buildString origStr sequences = buildString' origStr sequences 0 (0,"",0)
where
buildString' origStr sequences index accum#(lenC,cStr,lenOrig)
| null sequences = accum
| l /= index =
buildString' (drop l' origStr) sequences l (lenC + l' + 1, cStr ++ take l' origStr ++ "1", lenOrig + l')
| otherwise =
buildString' (drop u' origStr) rest u (lenC + length s', cStr ++ s', lenOrig + u')
where
l' = l - index
u' = u - l
s' = s ++ show lenS
(((l,u,patLen,lenS),s):rest) = sequences
compress [] _ accum = reverse accum ++ (if null accum then [] else "1")
compress zzs#(z:zs) maxSeqLen accum
| null (fst segment') = compress zs maxSeqLen (z:accum)
| (start,end) == (0,2) && not (null accum) = compress zs maxSeqLen (z:accum)
| otherwise =
reverse accum ++ (if null accum || takeWhile' compressedStr 0 /= 0 then [] else "1")
++ compressedStr
++ compress (drop lengthOriginal zzs) maxSeqLen []
where segment'#(s,(start,end)) = segment zzs 0 maxSeqLen
combinations = combs (fst $ segment') []
takeWhile' xxs count
| null xxs = 0
| x == '1' && null (reads (take 1 xs)::[(Int,String)]) = count
| not (null (reads [x]::[(Int,String)])) = 0
| otherwise = takeWhile' xs (count + 1)
where x:xs = xxs
f (lenC,cStr,lenOrig) (lenC',cStr',lenOrig') =
let g = compare ((fromIntegral lenC + if not (null accum) && takeWhile' cStr 0 == 0 then 1 else 0) / fromIntegral lenOrig)
((fromIntegral lenC' + if not (null accum) && takeWhile' cStr' 0 == 0 then 1 else 0) / fromIntegral lenOrig')
in if g == EQ
then compare (takeWhile' cStr' 0) (takeWhile' cStr 0)
else g
(lenCompressed,compressedStr,lengthOriginal) =
head $ sortBy f (map (buildString (take end zzs)) (map reverse combinations))
Output:
*Main> compress "aaaaaaaaabbbbbbbbbaaaaaaaaabbbbbbbbb" 100 []
"a9b9a9b9"
*Main> compress "aaabbbaaabbbaaabbbaaabbb" 100 []
"aaabbb4"