I implemented the Winograd algorithm on Haskell and tried to speed up the algorithm due to strict calculations. In this I succeeded, but I completely did not understand why, adding strictness, it starts to work faster. Since my code for this algorithm is large enough, I wrote two small functions that demonstrate this problem.
module Main where
import qualified Data.Vector as V
import qualified Data.Matrix as M
import Control.DeepSeq
import Control.Exception
import System.Clock
import Data.Time
matrixCtor x y size = M.matrix size size $ \(i,j) -> x*i+y*j
group v s = foldl (\acc i ->acc + V.unsafeIndex v i * V.unsafeIndex v (i+1)) 0 [0,2..s-1]
size = 3000 :: Int
testWithForce :: IO ()
testWithForce = do
let a = matrixCtor 2 1 size
evaluate $ force a
start <- getCurrentTime
let c = V.generate size $ \j -> M.getCol (j+1) a
evaluate $ force c
let d = foldl (\acc i ->acc + group (V.unsafeIndex c i) size) 0 [0,1..(size-1)]
evaluate $ force d
end <- getCurrentTime
print (diffUTCTime end start)
testWithoutForce :: IO ()
testWithoutForce = do
let a = matrixCtor (-2) 1 size
evaluate $ force a
start <- getCurrentTime
let c = V.generate size $ \j -> M.getCol (j+1) a
let d = foldl (\acc i ->acc + group (V.unsafeIndex c i) size) 0 [0,1..(size-1)]
evaluate $ force d
end <- getCurrentTime
print (diffUTCTime end start)
main :: IO ()
main = do
testWithForce
testWithoutForce
In the implementation of the algorithm, the matrices are computed before use, just like here. In the function testWithForce I calculate the value c before it is used. In this case, the function testWithForce works faster than the function testWithoutForce. I got the following results:
0.945078s --testWithForce
1.785158s --testWithoutForce
I just can not understand why strictness in this case speeds up the work so much.
Pardon the non-answer, but make sure to control for GC: it appears that the second function may be burdened with the GC from the previous one, thereby inflating the difference.
I can reproduce what you're seeing:
$ ghc -O3 --make foo.hs && ./foo
[1 of 1] Compiling Main ( foo.hs, foo.o )
Linking foo ...
1.471109207s
2.001165795s
However, when I flipped the order of the test, the result was different:
main = do
testWithoutForce
testWithForce
$ ghc -O3 --make foo.hs && ./foo
1.626452918s
1.609818958s
So I made main GC between each test:
import System.Mem
main = do
performMajorGC
testWithForce
performMajorGC
testWithoutForce
The forced one is still faster, but the difference was massively reduced:
1.460686986s
1.581715988s
Related
This problem involves an arbitrary number of dice with each an arbitrary number of sides. We then find the maximal number of dice that can be put in a straight, see Google's Code Jam explanation. The implementation should be reasonably efficient for around 10^5 dice with each one having up to 10^6 sides.
I've been trying to solve the problem in Haskell and this is my solution. However, it is not fast enough to earn full points on the problem, so can this be optimized?
import Data.List (sort)
import Data.Foldable (foldl')
getMaxStraight :: [Int] -> Int
getMaxStraight sides =
foldl'
(\maxStraight side -> if side > maxStraight then succ maxStraight else maxStraight)
0
(sort sides)
-- Doing IO in Haskell
-- Assuming the above works perfectly, something might not perform well below
main :: IO ()
main = do
line <- getLine :: IO String
let numberOfCases = read line :: Int
in mapM_ solveCase [1 .. numberOfCases]
solveCase :: Show a => a -> IO ()
solveCase i = do
line1 <- getLine :: IO String
line2 <- getLine :: IO String
let numberOfDice = read line1 :: Int
let diceSides = map read (words line2) :: [Int]
let maxStraight = getMaxStraight diceSides
putStrLn $ "Case #" ++ show i ++ ": " ++ show maxStraight
Aside from this, I've also written a Python solution which did run in time. I'd expect that Haskell would run faster than Python. What is going on?
def get_max_straight(sides):
max_straight = 0
for side in sorted(sides):
if side > max_straight:
max_straight += 1
return max_straight
# Doing IO in Python
# This works as needed
if __name__ == '__main__':
tests_len = int(input())
for case_num in range(1, 1 + tests_len):
input() # Discard unneeded input
sides = [int(s) for s in input().split(' ')]
print(f'Case #{case_num}: {get_max_straight(sides)}')
While writing a function using iterate in Haskell, I found that an equivalent version with explicit recursion seemed noticeably faster - even though I believed that explicit recursion ought to be frowned upon in Haskell.
Similarly, I expected GHC to be able to inline/optimise list combinators appropriately so that the resulting machine code is at least similarly performing to the explicit recursion.
Here's a (different) example, which also displays the slowdown I observed.
steps m n and its variant steps' compute the number of Collatz steps n takes to reach 1, giving up after m attempts.
steps uses explicit recursion while steps' uses list functions.
import Data.List (elemIndex)
import Control.Exception (evaluate)
import Control.DeepSeq (rnf)
collatz :: Int -> Int
collatz n
| even n = n `quot` 2
| otherwise = 3 * n + 1
steps :: Int -> Int -> Maybe Int
steps m = go 0
where go k n
| n == 1 = Just k
| k == m = Nothing
| otherwise = go (k+1) (collatz n)
steps' :: Int -> Int -> Maybe Int
steps' m = elemIndex 1 . take m . iterate collatz
main :: IO ()
main = evaluate $ rnf $ map (steps 800) $ [1..10^7]
I tested these by evaluating for all values up to 10^7, each giving up after 800 steps. On my machine (compiled with ghc -O2), explicit recursion took just under 4 seconds (3.899s) but list combinators took about 5 times longer (19.922s).
Why is explicit recursion so much better in this case, and is there a way of writing this without explicit recursion while preserving performance?
Updated: I submitted Trac 15426 for this bug.
The problem disappears if you copy the definitions of elemIndex and findIndex into your module:
import Control.Exception (evaluate)
import Control.DeepSeq (rnf)
import Data.Maybe (listToMaybe)
import Data.List (findIndices)
elemIndex :: Eq a => a -> [a] -> Maybe Int
elemIndex x = findIndex (x==)
findIndex :: (a -> Bool) -> [a] -> Maybe Int
findIndex p = listToMaybe . findIndices p
collatz :: Int -> Int
collatz n
| even n = n `quot` 2
| otherwise = 3 * n + 1
steps' :: Int -> Int -> Maybe Int
steps' m = elemIndex 1 . take m . iterate collatz
main :: IO ()
main = evaluate $ rnf $ map (steps' 800) $ [1..10^7]
The problem seems to be that these must be inlinable for GHC to get the fusion right. Unfortunately, neither of them is marked inlinable in Data.OldList.
The change to allow findIndex to participate in fusion is relatively recent (see Trac 14387) where listToMaybe was reimplemented as a foldr. So, it probably hasn't seen a lot of testing yet.
I'm a Haskell beginner and have chosen it to solve a programming task for my class, however my solution is too slow and doesn't get accepted. I'm trying to profile it and was hoping that I could get some pointers from more advanced Haskellers here.
The only other solution in my class that got accepted so far was written in Rust. I'm sure that I should be able to achieve similar performance in Haskell and I wrote horrible imperative code in the hope of improving performance, alas to no avail.
My first suspicion relates to work, where I am using forever to go over the in-degree array until I get an out-of-bounds exception. I was hoping for this to be tail-recursive and to compile to a while (true) style loop.
My second suspicion is that I/O is perhaps slowing things down.
EDIT: The problem has likely to do with my algorithm because I am not keeping a queue of nodes with indegree 0. Thank you #luqui.
EDIT2: It seems that the real bottleneck was I/O, I fixed that thanks to #Davislor.
The task is based on this: http://www.spoj.com/UKCPLAD/problems/TOPOSORT/ and I am constrained to use only the libraries in the Haskell Platform.
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE LambdaCase #-}
{-# OPTIONS_GHC -O3 #-}
import Control.Monad
import Data.Array.IO
import Data.IORef
import Data.Int
import Control.Exception
type List = []
type Node = Int32
type Edge = (Node, Node)
type Indegree = Int32
main = do
(numNodes, _) <- readPair <$> getLine
edges <- map readPair . lines <$> getContents
topo numNodes edges
-- lower bound
{-# INLINE lb #-}
lb = 1
topo :: Node -> List Edge -> IO ()
topo numNodes edges = do
result <- newIORef []
count <- newIORef 0
indegrees <- newArray (lb,numNodes) 0 :: IO (IOUArray Node Indegree)
neighbours <- newArray (lb,numNodes) [] :: IO (IOArray Node (List Node))
forM_ edges $ \(from,to) -> do
update indegrees to (+1)
update neighbours from (to:)
let work = forever $ do
z <- getNext indegrees
modifyIORef' result (z:)
modifyIORef' count (+1)
ns <- readArray neighbours z
forM_ ns $ \n -> update indegrees n pred
work `catch`
\(_ :: SomeException) -> do
count <- readIORef count
if numNodes == count
then (mapM_ (\n -> putStr (show n ++ " ")) . reverse) =<< readIORef result
else putStrLn "Sandro fails."
{-# INLINE update #-}
update a i f = do
x <- readArray a i
writeArray a i (f x)
{-# INLINE getNext #-}
getNext indegrees = getNext' indegrees =<< getBounds indegrees
{-# INLINE getNext' #-}
getNext' indegrees (lb,ub) = readArray indegrees lb >>= \case
0 -> writeArray indegrees lb (-1) >> return lb
_ -> getNext' indegrees (lb+1,ub)
readPair :: String -> (Node,Node)
{-# INLINE readPair #-}
readPair = toPair . map read . words
where toPair [x,y] = (x,y)
toPair _ = error "Only two entries per line allowed"
Example output
$ ./topo
8 9
1 4
1 2
4 2
4 3
3 2
5 2
3 5
8 2
8 6
^D
1 4 3 5 7 8 2 6
If you haven’t already, profile your program by compiling with -prof -fprof-auto and then executing with the command-line options +RTS -p. This will generate a profile *.prof that will tell you which functions the program is spending all its time in. However, I can see immediately where the biggest time-waster is. Your instincts were right: it’s the I/O.
Having done that a lot, I can guarantee you that you’ll find that it’s spending the vast majority of its time doing I/O. The first thing you should always do to speed up your program is rewrite it to use fast I/O. Haskell is a fast language, when you use the right data structures. The default I/O library in the Prelude uses singly-linked lists with lazily-evaluated thunks where each node holds a single Unicode character. That would be slow in C, too!
I’ve gotten the best results with Data.ByteString.Lazy.Char8 when the input is ASCII, and Data.ByteString.Builder to generate the output. (An alternative is Data.Text.) That gets you a lazily-evaluated list of strict character buffers on input (so interactive input and output still works), and fills a single buffer on output.
After you’ve written the skeleton of the program with fast I/O, the next step is to look at your algorithm, and especially your data structures. Use profiling to see where all the time goes. But I’d recommend you use a functional algorithm rather than trying to write imperative programs in Haskell with do.
I almost always approach problems like this in Haskell with a more functional style: in particular, my main function is almost always something similar to:
import qualified Data.ByteString.Lazy.Char8 as B8
main :: IO()
main = B8.interact ( output . compute . input )
This makes everything except the call to interact a pure function, and isolates the parsing code and the formatting code so the compute part in the middle can be independent of that.
Since this is an assignment and you want to solve the problem yourself, I’ll refrain from refactoring the program for you, but here’s an example I wrote in response to a question on another forum to perform a counting sort. It should be suitable as a skeleton for other kinds of problems.
import Data.Array.IArray (accumArray, assocs)
import Data.Array.Unboxed (UArray)
import Data.ByteString.Builder (Builder, char7, intDec, toLazyByteString)
import qualified Data.ByteString.Lazy.Char8 as B8
import Data.Monoid ((<>))
main :: IO()
main = B8.interact ( output . compute . input ) where
input :: B8.ByteString -> [Int]
input = map perLine . tail . B8.lines where
perLine = decode . B8.readInt
decode (Just (x, _)) = x
decode Nothing = error "Invalid input: expected integer."
compute :: [Int] -> [Int]
compute = concatMap expand . assocs . countingSort . map encode where
encode i = (i, 1)
countingSort :: [(Int, Int)] -> UArray Int Int
countingSort = accumArray (+) 0 (lower, upper)
lower = 0
upper = 1000000
expand (i,c) = replicate c i
output :: [Int] -> B8.ByteString
output = toLazyByteString . foldMap perCase where
perCase :: Int -> Builder
perCase x = intDec x <> char7 '\n'
At present, this version ran in less than half the time of anyone else’s Haskell solution for the same problem, the same holds true for the actual contest problems I’ve used it for, and the approach generalizes.
So I suggest changing the I/O to be similar to that, first, then profiling, and coming back with the profiling output if that doesn’t make enough of a difference. This might also be a good Code Review question.
Thanks to #Davislor's suggestions I managed to get it to be much faster and I also refactored the code for the better and now I actually have an m log(n) algorithm. Surprisingly this doesn't make that much of a difference—the I/O far outweighed the suboptimal complexity of the algorithm.
EDIT: got rid of unsafePerformIO and it actually runs a teeny-weeny bit faster. Plus adding -XStrict shaves off even more time.
{-# LANGUAGE Strict #-}
{-# LANGUAGE LambdaCase #-}
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE OverloadedStrings #-}
{-# OPTIONS_GHC -O2 #-}
import Control.Monad
import Data.Array.IO
import Data.Int
import Data.Set (Set)
import qualified Data.Set as Set
import Data.ByteString.Builder (Builder, char7, intDec, toLazyByteString)
import qualified Data.ByteString.Lazy.Char8 as B8
import Data.Monoid ((<>))
type List = []
type Node = Int
type Edge = (Node, Node)
type Indegree = Int
main = B8.putStrLn =<< topo . map readPair . B8.lines =<< B8.getContents
readPair :: B8.ByteString -> (Node,Node)
readPair str = (x,y)
where
(Just (x, str')) = B8.readInt str
(Just (y, _ )) = B8.readInt (B8.tail str')
topo :: List Edge -> IO B8.ByteString
topo inp = do
let (numNodes, _) = head inp
edges = tail inp
indegrees <- newArray (1,numNodes) 0 :: IO (IOUArray Node Indegree)
neighbours <- newArray (1,numNodes) [] :: IO (IOArray Node (List Node))
-- setup
forM_ edges $ \(from,to) -> do
update indegrees to (+1)
update neighbours from (to:)
zeroes <- collectIndegreeZero [] indegrees =<< getBounds indegrees
processQueue (Set.fromList zeroes) [] numNodes indegrees neighbours
where
collectIndegreeZero acc indegrees (lb,ub)
| lb > ub = return acc
| otherwise = do
indegr <- readArray indegrees lb
let acc' = if indegr == 0 then (lb:acc) else acc
collectIndegreeZero acc' indegrees (lb+1,ub)
processQueue queue result numNodes indegrees neighbours = do
if null queue
then if numNodes == 0
then return . toLazyByteString . foldMap whitespace . reverse $ result
else return "Sandro fails."
else do
(node,queue) <- return $ Set.deleteFindMin queue
ns <- readArray neighbours node
queue <- foldM decrIndegrees queue ns
processQueue queue (node:result) (numNodes-1) indegrees neighbours
where
decrIndegrees :: Set Node -> Node -> IO (Set Node)
decrIndegrees q n = do
i <- readArray indegrees n
writeArray indegrees n (i-1)
return $ if i == 1 then Set.insert n q else q
whitespace x = intDec x <> char7 ' '
{-# INLINE update #-}
update a i f = do
x <- readArray a i
writeArray a i (f x)
Haskell version(1.03s):
module Main where
import qualified Data.Text as T
import qualified Data.Text.IO as TIO
import Control.Monad
import Control.Applicative ((<$>))
import Data.Vector.Unboxed (Vector,(!))
import qualified Data.Vector.Unboxed as V
solve :: Vector Int -> Int
solve ar =
V.foldl' go 0 ar' where
ar' = V.zip ar (V.postscanr' max 0 ar)
go sr (p,m) = sr + m - p
main = do
t <- fmap (read . T.unpack) TIO.getLine -- With Data.Text, the example finishes 15% faster.
T.unlines . map (T.pack . show . solve . V.fromList . map (read . T.unpack) . T.words)
<$> replicateM t (TIO.getLine >> TIO.getLine) >>= TIO.putStr
F# version(0.17s):
open System
let solve (ar : uint64[]) =
let ar' =
let t = Array.scanBack max ar 0UL |> fun x -> Array.take (x.Length-1) x
Array.zip ar t
let go sr (p,m) = sr + m - p
Array.fold go 0UL ar'
let getIntLine() =
Console.In.ReadLine().Split [|' '|]
|> Array.choose (fun x -> if x <> "" then uint64 x |> Some else None)
let getInt() = getIntLine().[0]
let t = getInt()
for i=1 to int t do
getInt() |> ignore
let ar = getIntLine()
printfn "%i" (solve ar)
The above two programs are the solutions for the Stock Maximize problem and times are for the first test case of the Run Code button.
For some reason the F# version is roughly 6x faster, but I am pretty sure that if I replaced the slow library functions with imperative loops that I could speed it up by at least 3 times and more likely 10x.
Could the Haskell version be similarly improved?
I am doing the above for learning purposes and in general I am finding it difficult to figure out how to write efficient Haskell code.
If you switch to ByteString and stick with plain Haskell lists (instead of vectors) you will get a more efficient solution. You may also rewrite the solve function with a single left fold and bypass zip and right scan (1).
Overall, on my machine, I get 20 times performance improvement compared to your Haskell solution (2).
Below Haskell code performs faster than the F# code:
import Data.List (unfoldr)
import Control.Applicative ((<$>))
import Control.Monad (replicateM_)
import Data.ByteString (ByteString)
import qualified Data.ByteString as B
import qualified Data.ByteString.Char8 as C
parse :: ByteString -> [Int]
parse = unfoldr $ C.readInt . C.dropWhile (== ' ')
solve :: [Int] -> Int
solve xs = foldl go (const 0) xs minBound
where go f x s = if s < x then f x else s - x + f s
main = do
[n] <- parse <$> B.getLine
replicateM_ n $ B.getLine >> B.getLine >>= print . solve . parse
1. See edits for an earlier version of this answer which implements solve using zip and scanr.
2. HackerRank website shows even a larger performance improvement.
If I wanted to do that quickly in F# I would avoid all of the higher-order functions inside solve and just write a C-style imperative loop:
let solve (ar : uint64[]) =
let mutable sr, m = 0UL, 0UL
for i in ar.Length-1 .. -1 .. 0 do
let p = ar.[i]
m <- max p m
sr <- sr + m - p
sr
According to my measurements, this is 11x faster than your F#.
Then the performance is limited by the IO layer (unicode parsing) and string splitting. This can be optimised by reading into a byte buffer and writing the lexer by hand:
let buf = Array.create 65536 0uy
let mutable idx = 0
let mutable length = 0
do
use stream = System.Console.OpenStandardInput()
let rec read m =
let c =
if idx < length then
idx <- idx + 1
else
length <- stream.Read(buf, 0, buf.Length)
idx <- 1
buf.[idx-1]
if length > 0 && '0'B <= c && c <= '9'B then
read (10UL * m + uint64(c - '0'B))
else
m
let read() = read 0UL
for _ in 1UL .. read() do
Array.init (read() |> int) (fun _ -> read())
|> solve
|> System.Console.WriteLine
Just for the record, the F# version is also not optimal. I don't think it really matters at this point, but if people wanted to compare the performance, then it is worth noting that it can be made faster.
I have not tried very hard (you can certainly make it even faster by using restricted mutation, which would not be against the nature of F#), but simple change to use Seq instead of Array in the right places (to avoid allocating temporary arrays) makes the code about 2x to 3x faster:
let solve (ar : uint64[]) =
let ar' = Seq.zip ar (Array.scanBack max ar 0UL)
let go sr (p,m) = sr + m - p
Seq.fold go 0UL ar'
If you use Seq.zip, you can also drop the take call (because Seq.zip truncates the sequence automatically). Measured using #time using the following snippet:
let rnd = Random()
let inp = Array.init 100000 (fun _ -> uint64 (rnd.Next()))
for a in 0 .. 10 do ignore (solve inp) // Measure this line
I get around 150ms for the original code and something between 50-75ms using the new version.
The Problem
I want to simulate in Haskell a multivalue outputting functions. The Haskell code is generated (not hand written) - this is important information, see below:
This can be of course easly done by returning a tuple from function, like
f x y = (x+y, x-y)
But when using such function I have to know what kind of tuple it returns:
...
(out_f_1, out_f_2) = f a b
(out_g_1, out_g_2, out_g_3) = g out_f_1
...
And so on ... But while generating code, I don't know what is the type of ouput of lets say f, so right now I'm using the Data.List.Select package and simulate the above with:
import Data.List.Select
...
out_f = f a b
out_g = g (sel1 outf)
...
The problem is the performance - on my testing program, the version, which uses Data.List.Select is twice slower than the version written by hand.
This is very obvious situation, because Data.List.Select is written using classes and instances, so it uses some kind of runtime dictionary (If I'm not wrong).
(http://hackage.haskell.org/packages/archive/tuple/0.2.0.1/doc/html/src/Data-Tuple-Select.html#sel1)
The Question
I want to ask you If is it possible to somehow compile the version (which uses Data.List.Select) to be as fast as the manually crafted one?
I think there should be a switch to compiler, which will tell him to "instantiate" the classes and interfaces for each use (something like templates from C++).
Benchmarks
Test1.hs:
import qualified Data.Vector as V
import System.Environment
b :: Int -> Int
b x = x + 5
c x = b x + 1
d x = b x - 1
a x = c x + d x
main = do
putStrLn "Starting..."
args <- getArgs
let iternum = read (head args) :: Int in do
putStrLn $ show $ V.foldl' (+) 0 $ V.map (\i -> a (iternum-i))
$ V.enumFromTo 1 iternum
putStrLn "Done."
compile with ghc -O3 Test1.hs
Test2.hs:
import qualified Data.Vector as V
import Data.Tuple.Select
import Data.Tuple.OneTuple
import System.Environment
b x = OneTuple $ x + 5
c x = OneTuple $ (sel1 $ b x) + 1
d x = OneTuple $ (sel1 $ b x) - 1
a x = OneTuple $ (sel1 $ c x) + (sel1 $ d x)
main = do
putStrLn "Starting..."
args <- getArgs
let iternum = read (head args) :: Int in do
putStrLn $ show $ V.foldl' (+) 0 $ V.map (\i -> sel1 $ a (iternum-i))
$ V.enumFromTo 1 iternum
putStrLn "Done."
compile with ghc -O3 Test2.hs
Results
time ./Test1 10000000 = 5.54 s
time ./Test2 10000000 = 10.06 s
I am not sure, but it might be worthwhile to try
http://www.haskell.org/ghc/docs/7.0.3/html/users_guide/pragmas.html#specialize-pragma
Ok, the results I've posted are not accurate - as #sabauma told - the two codes perform in the same time If you compile them with optimizations enabled.
The #tohava's answer is very good if you want to explicity show which functions to specialize (see the #sabauma comment above).