How is this memoized DP table too slow for SPOJ? - performance

SPOILERS: I'm working on http://www.spoj.pl/problems/KNAPSACK/ so don't peek if you don't want a possible solution spoiled for you.
The boilerplate:
import Data.Sequence (index, fromList)
import Data.MemoCombinators (memo2, integral)
main = interact knapsackStr
knapsackStr :: String -> String
knapsackStr str = show $ knapsack items capacity numItems
where [capacity, numItems] = map read . words $ head ls
ls = lines str
items = map (makeItem . words) $ take numItems $ tail ls
Some types and helpers to set the stage:
type Item = (Weight, Value)
type Weight = Int
type Value = Int
weight :: Item -> Weight
weight = fst
value :: Item -> Value
value = snd
makeItem :: [String] -> Item
makeItem [w, v] = (read w, read v)
And the primary function:
knapsack :: [Item] -> Weight -> Int -> Value
knapsack itemsList = go
where go = memo2 integral integral knapsack'
items = fromList $ (0,0):itemsList
knapsack' 0 _ = 0
knapsack' _ 0 = 0
knapsack' w i | wi > w = exclude
| otherwise = max exclude include
where wi = weight item
vi = value item
item = items `index` i
exclude = go w (i-1)
include = go (w-wi) (i-1) + vi
And this code works; I've tried plugging in the SPOJ sample test case and it produces the correct result. But when I submit this solution to SPOJ (instead of importing Luke Palmer's MemoCombinators, I simply copy and paste the necessary parts into the submitted source), it exceeds the time limit. =/
I don't understand why; I asked earlier about an efficient way to perform 0-1 knapsack, and I'm fairly convinced that this is about as fast as it gets: a memoized function that will only recursively calculate the sub-entries that it absolutely needs in order to produce the correct result. Did I mess up the memoization somehow? Is there a slow point in this code that I am missing? Is SPOJ just biased against Haskell?
I even put {-# OPTIONS_GHC -O2 #-} at the top of the submission, but alas, it didn't help. I have tried a similar solution that uses a 2D array of Sequences, but it was also rejected as too slow.

There's one major problem which really slows this down. It's too polymorphic. Type-specialized versions of functions can be much faster than polymorphic varieties, and for whatever reason GHC isn't inlining this code to the point where it can determine the exact types in use. When I change the definition of integral to:
integral :: Memo Int
integral = wrap id id bits
I get an approximately 5-fold speedup; I think it's fast enough to be accepted on SPOJ.
This is still significantly slower than gorlum0's solution however. I suspect the reason is because he's using arrays and you use a custom trie type. Using a trie will take much more memory and also make lookups slower due to extra indirections, cache misses, etc. You might be able to make up a lot of the difference if you strictify and unbox fields in IntMap, but I'm not sure that's possible. Trying to strictify fields in BitTrie creates runtime crashes for me.
Pure haskell memoizing code can be good, but I don't think it's as fast as doing unsafe things (at least under the hood). You might apply Lennart Augustsson's technique to see if it fares better at memoization.

The one thing that slows down Haskell is IO, The String type in Haskell gives UTF8 support which we don't need for SPOJ. ByteStrings are blazing fast so you might want to consider using them instead.

Related

How to optimise Haskell code to pass HackerRanks timed out test cases (Not for any ongoing competition, just me practicing)

I been learning Haskell for around 4 months now and I have to say, the learning curve is definitely hard(scary also :p).
After solving about 15 easy questions, today I moved to my first medium difficulty problem on HackerRank https://www.hackerrank.com/challenges/climbing-the-leaderboard/problem.
It was 10 test cases and I am able to pass 6 of them, but the rest fail with timeout, now the interesting part is, I can already see a few parts that have potential for performance increase, for example, I am using nub to remove duplicated from a [Int], but still I am not able to build a mental model for algorithmic performance, the main cause of that being unsure about Haskell compiler will change my code and how laziness plays a role here.
import Data.List (nub)
getInputs :: [String] -> [String]
getInputs (_:r:_:p:[]) = [r, p]
findRating :: Int -> Int -> [Int] -> Int
findRating step _ [] = step
findRating step point (x:xs) = if point >= x then step else findRating (step + 1) point xs
solution :: [[Int]] -> [Int]
solution [rankings, points] = map (\p -> findRating 1 p duplicateRemovedRatings) points
where duplicateRemovedRatings = nub rankings
main :: IO ()
main = interact (unlines . map show . solution . map (map read . words) . getInputs . lines)
Test Case in GHCI
:l "solution"
let i = "7\n100 100 50 40 40 20 10\n4\n5 25 50 120"
solution i // "6\n4\n2\n1\n"
Specific questions I have:
Will the duplicateRemovedRankings variable be calculated once, or on each iteration of the map function call.
Like in imperative programming languages, I can verify the above question using some sort of printing mechanism, is there some equivalent way of doing the same with Haskell.
According to my current understanding, the complexity of this algorithm would be, I know nub is O(n^2)
findRating is O(n)
getInputs is O(1)
solution is O(n^2)
How can I reason about this and build a mental model for performance.
If this violates community guidelines, please comment and I will delete this. Thank you for the help :)
First, to answer your questions:
Yes, duplicateRemovedRankings is computed only once. No repeated computation.
To debug-trace, you can use trace and its friends (see the docs for examples and explanation). Yes, it can be used even in pure, non-IO code. But obviously, don't use it for "normal" output.
Yes, your understanding of complexity is correct.
Now, how to pass HackerRank's tricky tests.
First, yes, you're right that nub is O(N^2). However, in this particular case you don't have to settle for that. You can use the fact that the rankings come pre-sorted to get a linear version of nub. All you have to do is skip elements while they're equal to the next one:
betterNub (x:y:rest)
| x == y = betterNub (y:rest)
| otherwise = x : betterNub (y:rest)
betterNub xs = xs
This gives you O(N) for betterNub itself, but it's still not good enough for HackerRank, because the overall solution is still O(N*M) - for each game you are iterating over all rankings. No bueno.
But here you can get another improvement by observing that the rankings are sorted, and searching in a sorted list doesn't have to be linear. You can use a binary search instead!
To do this, you'll have to get yourself constant-time indexing, which can be achieved by using Array instead of list.
Here's my implementation (please don't judge harshly; I realize I probably got edge cases overengineered, but hey, it works!):
import Data.Array (listArray, bounds, (!))
findIndex arr p
| arr!end' > p = end' + 1
| otherwise = go start' end'
where
(start', end') = bounds arr
go start end =
let mid = (start + end) `div` 2
midValue = arr ! mid
in
if midValue == p then mid
else if mid == start then (if midValue < p then start else end)
else if midValue < p then go start mid
else go mid end
solution :: [[Int]] -> [Int]
solution [rankings, points] = map (\p -> findIndex duplicateRemovedRatings p + 1) points
where duplicateRemovedRatings = toArr $ betterNub rankings
toArr l = listArray (0, (length l - 1)) l
With this, you get O(log N) for the search itself, making the overall solution O(M * log N). And this seems to be good enough for HackerRank.
(note that I'm adding 1 to the result of findIndex - this is because the exercise requires 1-based index)
I believe Fyodor's answer is excellent for your first two and a half questions. For the final half, "How can I build a mental model for performance?", I can say that SPJ is an absolute master of writing highly technical papers in a way accessible to the smart but ignorant reader. The implementation book Implementing lazy functional languages on stock hardware is excellent and can serve as the basis of a mental execution model. There is also Okasaki's thesis, Purely functional data structures, which discusses a complementary and significantly higher-level approach to doing asymptotic complexity analyses. (Actually, I read his book, which apparently includes some extra content, so bear that in mind when deciding for yourself about this recommendation.)
Please don't be daunted by their length. I personally found they were actually actively fun to read; and the topic they cover is a big one, not compressible to a short/quick answer.

Performance of Longest Substring Without Repeating Characters in Haskell

Upon reading this Python question and proposing a solution, I tried to solve the same challenge in Haskell.
I've come up with the code below, which seems to work. However, since I'm pretty new to this language, I'd like some help in understand whether the code is good performancewise.
lswrc :: String -> String
lswrc s = reverse $ fst $ foldl' step ("","") s
where
step ("","") c = ([c],[c])
step (maxSubstr,current) c
| c `elem` current = step (maxSubstr,init current) c
| otherwise = let candidate = (c:current)
longerThan = (>) `on` length
newMaxSubstr = if maxSubstr `longerThan` candidate
then maxSubstr
else candidate
in (newMaxSubstr, candidate)
Some points I think could be better than they are
I carry on a pair of strings (the longest tracked, and the current candidate) but I only need the former; thinking procedurally, there's no way to escape this, but maybe FP allows another approach?
I construct (c:current) but I use it only in the else; I could make a more complicated longerThan to add 1 to the lenght of its second argument, so that I can apply it to maxSubstr and current, and construct (c:current) in the else, without even giving it a name.
I drop the last element of current when c is in the current string, because I'm piling up the strings with :; I could instead pattern match when checking c against the string (as in c `elem` current#(a:as)), but then when adding the new character I should do current ++ [c], which I know is not as performant as c:current.
I use foldl' (as I know foldl doesn't really make sense); foldr could be an alternative, but since I don't see how laziness enters this problem, I can't tell which one would be better.
Running elem on every iteration makes your algorithm Ω(n^2) (for strings with no repeats). Running length on, in the worst case, every iteration makes your algorithm Ω(n^2) (for strings with no repeats). Running init a lot makes your algorithm Ω(n*sqrt(n)) (for strings that are sqrt(n) repetitions of a sqrt(n)-long string, with every other one reversed, and assuming an O(1) elem replacement).
A better way is to pay one O(n) cost up front to copy into a data structure with constant-time indexing, and to keep a set (or similar data structure) of seen elements rather than a flat list. Like this:
import Data.Set (Set)
import Data.Vector (Vector)
import qualified Data.Set as S
import qualified Data.Vector as V
lswrc2 :: String -> String
lswrc2 "" = ""
lswrc2 s_ = go S.empty 0 0 0 0 where
s = V.fromList s_
n = V.length s
at = V.unsafeIndex s
go seen lo hi bestLo bestHi
| hi == n = V.toList (V.slice bestLo (bestHi-bestLo+1) s)
-- it is probably faster (possibly asymptotically so?) to use findIndex
-- to immediately pick the correct next value of lo
| at hi `S.member` seen = go (S.delete (at lo) seen) (lo+1) hi bestLo bestHi
| otherwise = let rec = go (S.insert (at hi) seen) lo (hi+1) in
if hi-lo > bestHi-bestLo then rec lo hi else rec bestLo bestHi
This should have O(n*log(n)) worst-case performance (achieving that worst case on strings with no repeats). There may be ways that are better still; I haven't thought super hard about it.
On my machine, lswrc2 consistently outperforms lswrc on random strings. On the string ['\0' .. '\100000'], lswrc takes about 40s and lswrc2 takes 0.03s. lswrc2 can handle [minBound .. maxBound] in about 0.4s; I gave up after more than 20 minutes of letting lswrc chew on that list.

solving sparse Ax=b in scipy

I need to solve Ax=b where A is the matrix that represents finite difference method for PDEs. Typical size of A for a 2D problem is around (256^2)x(256^2), and it consists of a few diagonals. The following sample code is how I construct A:
N = Nx*Ny # Nx is no. of cols (size in x-direction), Ny is no. rows (size in y-direction)
# finite difference in x-direction
up1 = (0.5)*c
up1[Nx-1::Nx] = 0
down1 = (-0.5)*c
down1[::Nx] = 0
matX = diags([down1[1:], up1[:-1]], [-1,1], format='csc')
# finite difference in y-direction
up1 = (0.5)*c
down1 = (-0.5)*c
matY = diags([down1[Nx:], up1[:N-Nx]], [-Nx,Nx], format='csc')
Adding matX and matY together results in four diagonals. The above is for second-order discretization. For fourth-order discretization, then I have eight diagonals. If I have second derivative, then the main diagonal is nonzero as well.
I use the following code to do the actual solving:
# Initialize A_fixed, B_fixed
if const is True: # the potential term V(x) is time-independent
A = A_fixed + sp.sparse.diags(V_func(x))
B = B_fixed + sp.sparse.diags(V_func(x))
A_factored = sp.sparse.linalg.factorized(A)
for idx, t in enumerate(t_steps[1:],1):
# Solve Ax=b=Bu
if const in False: #
A = A_fixed + sp.sparse.diags(V_func(x,t))
B = B_fixed + sp.sparse.diags(V_func(x,t))
psi_n = B.dot(psi_old)
if const is True:
psi_new = A_factored(psi_n)
else:
psi_new = sp.sparse.linalg.spsolve(A,psi_n,use_umfpack=False)
psi_old = psi_new.copy()
I have a couple questions:
What's the best way to solve Ax=b in scipy? I use the spsolve in the sp.sparse.linalg library, which uses the LU-decomposition. I tried using the standard sp.linalg.solve for dense matrix, but it's considerably slower. I also tried using all the other iterative solvers in the sp.sparse.linalg library, but they are also slower (for 1000x1000, they all take a couple seconds, compared to less than half a second for spsolve, and my A is likely to be a lot bigger). Are there any alternative ways to do the computation?
Edit: The problem I'm trying to solve is actually the time-dependent Schrodinger Equation. If the potential term is time-independent, then as suggested I can pre-factorize the matrix A first to speed up the code, but this doesn't work if the potential is time-varying, as I need to change the diagonal term of both matrices A and B at each time step. For this specific problem, are there any ways to speed up the code using method similar to pre-factorization or other ways?
I have installed umfpack. I tried setting use_umfpack True and False to test it, but surprisingly use_umfpack=True takes nearly twice longer than use_umfpack=False. I thought having this package should give a speed up. Any idea what that's the case? (PS: I am using Anaconda Spyder to run the code if that makes any difference)
I have use cProfile to profile my codes, and nearly all the time is spent on that line with spsolve. So I think the remaining part of the code (matrix /problem initialization) is pretty much optimized, and it's the matrix solving part that needs to be improved.
Thanks.

find keystrokes for On Screen Keyboard scala

I am trying to solve a recent interview question using Scala..
You have an on screen keyboard which is a grid of 6 rows , 5 columns each. With alphabets from A to Z and blank space are arranged in the grid row first.
You can use this on screen keyboard to type words.. by using your TV Remote by press Left, Right, Up , Down or OK keys to type each character.
Question: given an input string, find the sequence of keystrokes needed to be pressed on the remote to type the input.
The code implementation can be found at
https://github.com/mradityagoyal/scala/blob/master/OnScrKb/src/main/scala/OnScrKB.scala
I have tried to solve this using three different approaches..
Simple forldLeft.
def keystrokesByFL(input: String, startChar: Char = 'A'): String = {
val zero = ("", startChar)
//(acc, last) + next => (acc+ aToB , next)
def op(zero: (String, Char), next: Char): (String, Char) = zero match {
case (acc, last) => (acc + path(last, next), next)
}
val result = input.foldLeft(zero)(op)
result._1
}
divide and conquer - Uses divide and conquer mechanism. The algorithm is similar to merge sort. * We split the input word into two if the length is > 3 * we recursively call the subroutine to get the path of left and right halves from the split. * In the end.. we add the keystrokes for first + keystrokes from end of first string to start of second string + keystrokes for second. * Essentially we divide the input string in two smaller halves till we get to size 4. for smaller than 4 we use the fold right.
def keystrokesByDnQ(input: String, startChar: Char = 'A'): String = {
def splitAndMerge(in: String, startChar: Char): String = {
if (in.length() < 4) {
//if length is <4 then dont split.. as you might end up with one side having only 1 char.
keystrokesByFL(in, startChar)
} else {
//split
val (x, y) = in.splitAt(in.length() / 2)
splitAndMerge(x, startChar) + splitAndMerge(y, x.last)
}
}
splitAndMerge(input, startChar)
}
Fold - uses the property that the underlying operation is associative (but not commutative). * For eg.. the keystrokes("ABCDEFGHI", startChar = 'A') == keystrokes("ABC", startChar='A')+keystrokes("DEF", 'C') + keystrokes("GHI", 'F')
def keystrokesByF(input: String, startChar: Char = 'A'): String = {
val mapped = input.map { x => PathAcc(text = "" + x, path = "") } // map each character in input to case class PathAcc("CharAsString", "")
val z = PathAcc(text = ""+startChar, path = "") //the starting char.
def op(left: PathAcc, right: PathAcc): PathAcc = {
PathAcc(text = left.text + right.text, path = left.path + path(left.text.last, right.text.head) + right.path)
}
val foldresult = mapped.fold(z)(op)
foldresult.path
}
My questions:
1. Is the divide and conquer approach better than Fold?
are Fold and Divide and conquer better than foldLeft (for this specific problem)
Is there a way i can represent the divide and conquer approach or the Fold approach as a Monad? I can see the associative law being satisfied... but i am not able to figure out if a monoid is present here.. and if yes.. what does it achieve for me?
Is Divide and conquer approach the best one available for this particular problem?
Which approach is better suited for spark?
Any suggestions are welcome..
Here's how I would do it:
def keystrokes(input: String, start: Char): String =
((start + input) zip input).par.map((path _).tupled).fold("")(_ ++ _)
The main point here is using the par method to parallelize the sequence of (Char, Char), so that it can parallelize the map, and take the optimal implementation for fold.
The algorithm simply take the characters in the String two by two (representing the units of path to be walked), computes the path between them, and then concatenates the result. Note that fold("")(_ ++ _) is basically mkString (although mkString on parallel collection is implemented by seq.mkString so it is much less efficient).
What your implementations dearly miss is parallelization of tasks. Even in your divide-and-conquer approach, you never run code in parallel, so you will wait for the first half to be finished before starting the second half (even though they are totally independant).
Assuming you use parallelization, the classical implementation of fold on parallel sequences is precisely the divide-and-conquer algorithm you described, but it may be that it is better optimized (for instance, it may choose another value than 3 for chunk size, I tend to trust the scala-collection implementers on these matters).
Note that fold on String is probably implemented with foldLeft, so there is no added value than what you did with foldLeft, unless you use .par before.
Back to your questions (I'll mostly repeat what I just said):
1) Yes, the divide and conquer is better than fold... on String (but not on parallelized String)
2) Fold can only be better than FoldLeft with some kind of parallelization, in which case it will be as good as (or better than, if there is a better implementation for a particular parallelized collection) divide-and-conquer.
3) I don't see what monads have to do with anything here. the operator and zero for fold must indeed form a monoid (otherwise, you'll have some problems with operation ordering if the operator is not associative, and unwanted noise if zero is not a neutral element).
4) Yes, that I know of, once parallelized
5) Spark is inherently parallel, so the main issue would be to join all the pieces back together in the end. What I mean is that an RDD is not ordered, so you'll need to keep some information on which piece of input should be put where in your cluster. Once you've done that correctly (using partitions and such, this would probably be a whole question itself), using map and fold still works as a charm (Spark was designed to have an API as close as possible to scala-collection, so that's really nice here).

Speed up Python nested loops with conditional statements

I am converting code from MATLAB to python in order to speed up simple operations. I have written a function which contains nested loops and a conditional statement; the purpose of the loop is to return a list of indices for the nearest elements in array x when compared to array y. I am comparing in the order of 1e5 items which takes about 30 sec to run. Any help to speed this process up will be greatly appreciated! I have had partial sucess with using the numba-pro automatic just in time compiler:
#autojit()
def find_nearest(x,y,idx):
idx_old = 0
rng1 = range(y.shape[0])
rng2 = range(x.shape[0])
for i in rng1:
prev = abs(x[idx_old]-y[i])
for j in rng2:
if abs(x[j]-y[i]) < prev:
prev = abs(x[j]-y[i])
idx_old = j
idx[i] = idx_old
return idx
Sorry for being such a noob, I am brand new to python!
Nothing wrong with your Numba code, except that the algorithm is not as efficient as can be. Much better is to sort the x array and do a binary search, very similar to this answer and also this answer:
def find_nearest(x, y):
indices = np.argsort(x)
loc = np.searchsorted(x[indices], y)
right = indices.take(loc, mode='clip')
left = indices.take(loc-1, mode='clip')
return np.where(abs(y-x[left]) < abs(y-x[right]), left, right)
On my PC this is about 80x faster than even the KDTree approach for x and y having 106 and 105 elements respectively. About two-thirds of the time is spent argsort-ing the array, so I don't think you can gain much with Numba here.
I have found an interim solution to my problem. By implementing the scipy.spatial's kdtree I was able to cut down the run time from 32s to just under 10s. This is still four times slower than the MATLAB knnsearch algorithm; and understanding how to speed up loops with conditional statements is still important. But for the moment this revised implementation is faster:
from scipy import spatial
from numpy import matrix
tree = spatial.KDTree(matrix(x).T)
(_, idxx) = tree.query(matrix(y).T)
The arrays x and y were in flat 1d formats; the tree required queries to be in column vector form.
Any suggestions to improve the run time of the original implementation would be greatly appreciated!

Resources