F# performance bug? - performance

let test1fun x = [for i in 1..x-> i]
let test2fun x y= [for i in 1..x
do for i in 1..y-> i]
let singlesearcher i =
let rec searcher j agg =
if j > i
then agg
else searcher (j+1) (i::agg)
searcher 1 []
let doublesearcher i j =
let rec searcher k l agg =
if k > i
then searcher 1 (l+1) agg
else if l > j
then agg
else searcher (k+1) l ((k,l)::agg)
searcher 1 1 []
executing the above with #time and 10000 for all inputs yields
list comprehension/singlesearcher-> negligable
cross product -> 320
list comprehension crossproduct -> 630
Why is the nested list comprehension more than twice the the functional version?

Yes. List comprehension is usually slower than directly using F# list or array. (On my machine, I also find similar timing with you.)
Let's look into how they are implemented. The list comprehension version is actually quite complicated:
a sequence/IEnumerable<int> is created using the comprehension syntax. This is just a lazy sequence, little time is spent here.
then this sequence is transformed into F# List by using something like Seq.toList. The actual time is spent here. There are a lot of HasNext MoveNext and switch (state) like code executed here. With so many function calls, you cannot expect it fast.
While the functional version doublesearcher is properly optimized into a tail recursion. This is a more direct version than list comprehension, and few function calls are used therein.
Usually we don't care these small performance difference for sequence, lists or arrays if the operation is not very critical. I think in your example, the generation is anyway one-time. The two time timing is not a big problem. For other cases, e.g. the dot product of two vectors, using arrays could save a lot of time because this operation is executed for a lot of times.


How to optimise Haskell code to pass HackerRanks timed out test cases (Not for any ongoing competition, just me practicing)

I been learning Haskell for around 4 months now and I have to say, the learning curve is definitely hard(scary also :p).
After solving about 15 easy questions, today I moved to my first medium difficulty problem on HackerRank https://www.hackerrank.com/challenges/climbing-the-leaderboard/problem.
It was 10 test cases and I am able to pass 6 of them, but the rest fail with timeout, now the interesting part is, I can already see a few parts that have potential for performance increase, for example, I am using nub to remove duplicated from a [Int], but still I am not able to build a mental model for algorithmic performance, the main cause of that being unsure about Haskell compiler will change my code and how laziness plays a role here.
import Data.List (nub)
getInputs :: [String] -> [String]
getInputs (_:r:_:p:[]) = [r, p]
findRating :: Int -> Int -> [Int] -> Int
findRating step _ [] = step
findRating step point (x:xs) = if point >= x then step else findRating (step + 1) point xs
solution :: [[Int]] -> [Int]
solution [rankings, points] = map (\p -> findRating 1 p duplicateRemovedRatings) points
where duplicateRemovedRatings = nub rankings
main :: IO ()
main = interact (unlines . map show . solution . map (map read . words) . getInputs . lines)
Test Case in GHCI
:l "solution"
let i = "7\n100 100 50 40 40 20 10\n4\n5 25 50 120"
solution i // "6\n4\n2\n1\n"
Specific questions I have:
Will the duplicateRemovedRankings variable be calculated once, or on each iteration of the map function call.
Like in imperative programming languages, I can verify the above question using some sort of printing mechanism, is there some equivalent way of doing the same with Haskell.
According to my current understanding, the complexity of this algorithm would be, I know nub is O(n^2)
findRating is O(n)
getInputs is O(1)
solution is O(n^2)
How can I reason about this and build a mental model for performance.
If this violates community guidelines, please comment and I will delete this. Thank you for the help :)
First, to answer your questions:
Yes, duplicateRemovedRankings is computed only once. No repeated computation.
To debug-trace, you can use trace and its friends (see the docs for examples and explanation). Yes, it can be used even in pure, non-IO code. But obviously, don't use it for "normal" output.
Yes, your understanding of complexity is correct.
Now, how to pass HackerRank's tricky tests.
First, yes, you're right that nub is O(N^2). However, in this particular case you don't have to settle for that. You can use the fact that the rankings come pre-sorted to get a linear version of nub. All you have to do is skip elements while they're equal to the next one:
betterNub (x:y:rest)
| x == y = betterNub (y:rest)
| otherwise = x : betterNub (y:rest)
betterNub xs = xs
This gives you O(N) for betterNub itself, but it's still not good enough for HackerRank, because the overall solution is still O(N*M) - for each game you are iterating over all rankings. No bueno.
But here you can get another improvement by observing that the rankings are sorted, and searching in a sorted list doesn't have to be linear. You can use a binary search instead!
To do this, you'll have to get yourself constant-time indexing, which can be achieved by using Array instead of list.
Here's my implementation (please don't judge harshly; I realize I probably got edge cases overengineered, but hey, it works!):
import Data.Array (listArray, bounds, (!))
findIndex arr p
| arr!end' > p = end' + 1
| otherwise = go start' end'
(start', end') = bounds arr
go start end =
let mid = (start + end) `div` 2
midValue = arr ! mid
if midValue == p then mid
else if mid == start then (if midValue < p then start else end)
else if midValue < p then go start mid
else go mid end
solution :: [[Int]] -> [Int]
solution [rankings, points] = map (\p -> findIndex duplicateRemovedRatings p + 1) points
where duplicateRemovedRatings = toArr $ betterNub rankings
toArr l = listArray (0, (length l - 1)) l
With this, you get O(log N) for the search itself, making the overall solution O(M * log N). And this seems to be good enough for HackerRank.
(note that I'm adding 1 to the result of findIndex - this is because the exercise requires 1-based index)
I believe Fyodor's answer is excellent for your first two and a half questions. For the final half, "How can I build a mental model for performance?", I can say that SPJ is an absolute master of writing highly technical papers in a way accessible to the smart but ignorant reader. The implementation book Implementing lazy functional languages on stock hardware is excellent and can serve as the basis of a mental execution model. There is also Okasaki's thesis, Purely functional data structures, which discusses a complementary and significantly higher-level approach to doing asymptotic complexity analyses. (Actually, I read his book, which apparently includes some extra content, so bear that in mind when deciding for yourself about this recommendation.)
Please don't be daunted by their length. I personally found they were actually actively fun to read; and the topic they cover is a big one, not compressible to a short/quick answer.

Performance of Longest Substring Without Repeating Characters in Haskell

Upon reading this Python question and proposing a solution, I tried to solve the same challenge in Haskell.
I've come up with the code below, which seems to work. However, since I'm pretty new to this language, I'd like some help in understand whether the code is good performancewise.
lswrc :: String -> String
lswrc s = reverse $ fst $ foldl' step ("","") s
step ("","") c = ([c],[c])
step (maxSubstr,current) c
| c `elem` current = step (maxSubstr,init current) c
| otherwise = let candidate = (c:current)
longerThan = (>) `on` length
newMaxSubstr = if maxSubstr `longerThan` candidate
then maxSubstr
else candidate
in (newMaxSubstr, candidate)
Some points I think could be better than they are
I carry on a pair of strings (the longest tracked, and the current candidate) but I only need the former; thinking procedurally, there's no way to escape this, but maybe FP allows another approach?
I construct (c:current) but I use it only in the else; I could make a more complicated longerThan to add 1 to the lenght of its second argument, so that I can apply it to maxSubstr and current, and construct (c:current) in the else, without even giving it a name.
I drop the last element of current when c is in the current string, because I'm piling up the strings with :; I could instead pattern match when checking c against the string (as in c `elem` current#(a:as)), but then when adding the new character I should do current ++ [c], which I know is not as performant as c:current.
I use foldl' (as I know foldl doesn't really make sense); foldr could be an alternative, but since I don't see how laziness enters this problem, I can't tell which one would be better.
Running elem on every iteration makes your algorithm Ω(n^2) (for strings with no repeats). Running length on, in the worst case, every iteration makes your algorithm Ω(n^2) (for strings with no repeats). Running init a lot makes your algorithm Ω(n*sqrt(n)) (for strings that are sqrt(n) repetitions of a sqrt(n)-long string, with every other one reversed, and assuming an O(1) elem replacement).
A better way is to pay one O(n) cost up front to copy into a data structure with constant-time indexing, and to keep a set (or similar data structure) of seen elements rather than a flat list. Like this:
import Data.Set (Set)
import Data.Vector (Vector)
import qualified Data.Set as S
import qualified Data.Vector as V
lswrc2 :: String -> String
lswrc2 "" = ""
lswrc2 s_ = go S.empty 0 0 0 0 where
s = V.fromList s_
n = V.length s
at = V.unsafeIndex s
go seen lo hi bestLo bestHi
| hi == n = V.toList (V.slice bestLo (bestHi-bestLo+1) s)
-- it is probably faster (possibly asymptotically so?) to use findIndex
-- to immediately pick the correct next value of lo
| at hi `S.member` seen = go (S.delete (at lo) seen) (lo+1) hi bestLo bestHi
| otherwise = let rec = go (S.insert (at hi) seen) lo (hi+1) in
if hi-lo > bestHi-bestLo then rec lo hi else rec bestLo bestHi
This should have O(n*log(n)) worst-case performance (achieving that worst case on strings with no repeats). There may be ways that are better still; I haven't thought super hard about it.
On my machine, lswrc2 consistently outperforms lswrc on random strings. On the string ['\0' .. '\100000'], lswrc takes about 40s and lswrc2 takes 0.03s. lswrc2 can handle [minBound .. maxBound] in about 0.4s; I gave up after more than 20 minutes of letting lswrc chew on that list.

Speed up Python nested loops with conditional statements

I am converting code from MATLAB to python in order to speed up simple operations. I have written a function which contains nested loops and a conditional statement; the purpose of the loop is to return a list of indices for the nearest elements in array x when compared to array y. I am comparing in the order of 1e5 items which takes about 30 sec to run. Any help to speed this process up will be greatly appreciated! I have had partial sucess with using the numba-pro automatic just in time compiler:
def find_nearest(x,y,idx):
idx_old = 0
rng1 = range(y.shape[0])
rng2 = range(x.shape[0])
for i in rng1:
prev = abs(x[idx_old]-y[i])
for j in rng2:
if abs(x[j]-y[i]) < prev:
prev = abs(x[j]-y[i])
idx_old = j
idx[i] = idx_old
return idx
Sorry for being such a noob, I am brand new to python!
Nothing wrong with your Numba code, except that the algorithm is not as efficient as can be. Much better is to sort the x array and do a binary search, very similar to this answer and also this answer:
def find_nearest(x, y):
indices = np.argsort(x)
loc = np.searchsorted(x[indices], y)
right = indices.take(loc, mode='clip')
left = indices.take(loc-1, mode='clip')
return np.where(abs(y-x[left]) < abs(y-x[right]), left, right)
On my PC this is about 80x faster than even the KDTree approach for x and y having 106 and 105 elements respectively. About two-thirds of the time is spent argsort-ing the array, so I don't think you can gain much with Numba here.
I have found an interim solution to my problem. By implementing the scipy.spatial's kdtree I was able to cut down the run time from 32s to just under 10s. This is still four times slower than the MATLAB knnsearch algorithm; and understanding how to speed up loops with conditional statements is still important. But for the moment this revised implementation is faster:
from scipy import spatial
from numpy import matrix
tree = spatial.KDTree(matrix(x).T)
(_, idxx) = tree.query(matrix(y).T)
The arrays x and y were in flat 1d formats; the tree required queries to be in column vector form.
Any suggestions to improve the run time of the original implementation would be greatly appreciated!

How is this memoized DP table too slow for SPOJ?

SPOILERS: I'm working on http://www.spoj.pl/problems/KNAPSACK/ so don't peek if you don't want a possible solution spoiled for you.
The boilerplate:
import Data.Sequence (index, fromList)
import Data.MemoCombinators (memo2, integral)
main = interact knapsackStr
knapsackStr :: String -> String
knapsackStr str = show $ knapsack items capacity numItems
where [capacity, numItems] = map read . words $ head ls
ls = lines str
items = map (makeItem . words) $ take numItems $ tail ls
Some types and helpers to set the stage:
type Item = (Weight, Value)
type Weight = Int
type Value = Int
weight :: Item -> Weight
weight = fst
value :: Item -> Value
value = snd
makeItem :: [String] -> Item
makeItem [w, v] = (read w, read v)
And the primary function:
knapsack :: [Item] -> Weight -> Int -> Value
knapsack itemsList = go
where go = memo2 integral integral knapsack'
items = fromList $ (0,0):itemsList
knapsack' 0 _ = 0
knapsack' _ 0 = 0
knapsack' w i | wi > w = exclude
| otherwise = max exclude include
where wi = weight item
vi = value item
item = items `index` i
exclude = go w (i-1)
include = go (w-wi) (i-1) + vi
And this code works; I've tried plugging in the SPOJ sample test case and it produces the correct result. But when I submit this solution to SPOJ (instead of importing Luke Palmer's MemoCombinators, I simply copy and paste the necessary parts into the submitted source), it exceeds the time limit. =/
I don't understand why; I asked earlier about an efficient way to perform 0-1 knapsack, and I'm fairly convinced that this is about as fast as it gets: a memoized function that will only recursively calculate the sub-entries that it absolutely needs in order to produce the correct result. Did I mess up the memoization somehow? Is there a slow point in this code that I am missing? Is SPOJ just biased against Haskell?
I even put {-# OPTIONS_GHC -O2 #-} at the top of the submission, but alas, it didn't help. I have tried a similar solution that uses a 2D array of Sequences, but it was also rejected as too slow.
There's one major problem which really slows this down. It's too polymorphic. Type-specialized versions of functions can be much faster than polymorphic varieties, and for whatever reason GHC isn't inlining this code to the point where it can determine the exact types in use. When I change the definition of integral to:
integral :: Memo Int
integral = wrap id id bits
I get an approximately 5-fold speedup; I think it's fast enough to be accepted on SPOJ.
This is still significantly slower than gorlum0's solution however. I suspect the reason is because he's using arrays and you use a custom trie type. Using a trie will take much more memory and also make lookups slower due to extra indirections, cache misses, etc. You might be able to make up a lot of the difference if you strictify and unbox fields in IntMap, but I'm not sure that's possible. Trying to strictify fields in BitTrie creates runtime crashes for me.
Pure haskell memoizing code can be good, but I don't think it's as fast as doing unsafe things (at least under the hood). You might apply Lennart Augustsson's technique to see if it fares better at memoization.
The one thing that slows down Haskell is IO, The String type in Haskell gives UTF8 support which we don't need for SPOJ. ByteStrings are blazing fast so you might want to consider using them instead.

How do I optimize these ocaml functions for dynamic interval scheduling?

I have a program that solves the weighted interval scheduling problem using dynamic programming (and believe it or not, it isn't for homework). I've profiled it, and I seem to be spending most of my time filling M with p(...). Here are the functions:
let rec get_highest_nonconflicting prev count start =
match prev with
head :: tail ->
if head < start then
get_highest_nonconflicting tail (count - 1) start
| [] -> 0;;
let m_array = Array.make (num_genes + 1) 0;;
let rec fill_m_array ?(count=1) ?(prev=[]) gene_spans =
match gene_spans with
head :: tail -> m_array.(count) <-
get_highest_nonconflicting prev (count - 1) (get_start head);
fill_m_array tail ~prev:(get_stop (head) :: prev) ~count:(count+1);
| [] -> ();;
I can't really think of any ways to optimize this, and based on my knowledge of this algorithm, this seems to be the place that is likely to take up the most time. But this is also only my second OCaml program. So is there any way to optimize this?
There isn't anything obviously inefficient with your two functions. Did you expect your implementation to be faster, for instance with reference to an implementation in another language?
I was wondering about the lists passed to get_highest_nonconflicting. If you have reasons to expect that this function is often passed lists that are physically identical to previously passed lists (and this includes the sub-lists passed on recursive calls), a cache there could help.
If you expect lists that are equal but physically different, hash-consing (and then caching) could help.
