How to achieve Asynchrony instead of Parallelism in F# - performance

(Sticking to a common example with async fetch of many web pages)
How would I spin off multiple (hundreds) of web page requests asynchronously, and then wait for all requests to complete before going to the next step? Async.AsParallel processes a few requests at a time, controlled by number of cores on the CPU. Grabbing a web page is not a CPU-bound operation. Not satisfied with the speedup of Async.AsParallel, I am looking for alternatives.
I tried to connect the dots between Async.StartAsTask and Task[].WaitAll. Instinctively, I wrote the following code, but it does not compile.
let processItemsConcurrently (items : int seq) =
let tasks = items |> Seq.map (fun item -> Async.StartAsTask(fetchAsync item))
Tasks.Task.WaitAll(tasks)
How would you approach this?

Async.Parallel is almost definitely right here. Not sure what you're not happy with; the strength of F# asyncs lies more in async computing than in task-parallel CPU-bound stuff (which is more tailored to Tasks and the .NET 4.0 TPL). Here's a full example:
open System.Diagnostics
open System.IO
open System.Net
open Microsoft.FSharp.Control.WebExtensions
let sites = [|
"http://bing.com"
"http://google.com"
"http://cnn.com"
"http://stackoverflow.com"
"http://yahoo.com"
"http://msdn.com"
"http://microsoft.com"
"http://apple.com"
"http://nfl.com"
"http://amazon.com"
"http://ebay.com"
"http://expedia.com"
"http://twitter.com"
"http://reddit.com"
"http://hulu.com"
"http://youtube.com"
"http://wikipedia.org"
"http://live.com"
"http://msn.com"
"http://wordpress.com"
|]
let print s =
// careful, don't create a synchronization bottleneck by printing
//printf "%s" s
()
let printSummary info fullTimeMs =
Array.sortInPlaceBy (fun (i,_,_) -> i) info
// for i, size, time in info do
// printfn "%2d %7d %5d" i size time
let longest = info |> Array.map (fun (_,_,time) -> time) |> Array.max
printfn "longest request took %dms" longest
let bytes = info |> Array.sumBy (fun (_,size,_) -> float size)
let seconds = float fullTimeMs / 1000.
printfn "sucked down %7.2f KB/s" (bytes / 1024.0 / seconds)
let FetchAllSync() =
let allsw = Stopwatch.StartNew()
let info = sites |> Array.mapi (fun i url ->
let sw = Stopwatch.StartNew()
print "S"
let req = WebRequest.Create(url)
use resp = req.GetResponse()
use stream = resp.GetResponseStream()
use reader = new StreamReader(stream,
System.Text.Encoding.UTF8, true, 4096)
print "-"
let contents = reader.ReadToEnd()
print "r"
i, contents.Length, sw.ElapsedMilliseconds)
let time = allsw.ElapsedMilliseconds
printSummary info time
time, info |> Array.sumBy (fun (_,size,_) -> size)
let FetchAllAsync() =
let allsw = Stopwatch.StartNew()
let info = sites |> Array.mapi (fun i url -> async {
let sw = Stopwatch.StartNew()
print "S"
let req = WebRequest.Create(url)
use! resp = req.AsyncGetResponse()
use stream = resp.GetResponseStream()
use reader = new AsyncStreamReader(stream, // F# PowerPack
System.Text.Encoding.UTF8, true, 4096)
print "-"
let! contents = reader.ReadToEnd() // in F# PowerPack
print "r"
return i, contents.Length, sw.ElapsedMilliseconds })
|> Async.Parallel
|> Async.RunSynchronously
let time = allsw.ElapsedMilliseconds
printSummary info time
time, info |> Array.sumBy (fun (_,size,_) -> size)
// By default, I think .NET limits you to 2 open connections at once
ServicePointManager.DefaultConnectionLimit <- sites.Length
for i in 1..3 do // to warmup and show variance
let time1,r1 = FetchAllSync()
printfn "Sync took %dms, result was %d" time1 r1
let time2,r2 = FetchAllAsync()
printfn "Async took %dms, result was %d (speedup=%2.2f)"
time2 r2 (float time1/ float time2)
printfn ""
On my 4-core box, this consistently gives a nearly 4x speedup.
EDIT
In reply to your comment, I've updated the code. You're right in that I've added more sites and am not seeing the expected speedup (still holding steady around 4x). I've started adding a little debugging output above, will continue investigating to see if something else is throttling the connections...
EDIT
Editted the code again. Well, I found what might be the bottleneck. Here's the implementation of AsyncReadToEnd in the PowerPack:
type System.IO.StreamReader with
member s.AsyncReadToEnd () =
FileExtensions.UnblockViaNewThread (fun () -> s.ReadToEnd())
In other words, it just blocks a threadpool thread and reads synchronously. Argh!!! Let me see if I can work around that.
EDIT
Ok, the AsyncStreamReader in the PowerPack does the right thing, and I'm using that now.
However, the key issue seems to be variance.
When you hit, say, cnn.com, a lot of the time the result will come back in like 500ms. But every once in a while you get that one request that takes 4s, and this of course potentially kills the apparent async perf, since the overall time is the time of the unluckiest request.
Running the program above, I see speedups from about 2.5x to 9x on my 2-core box at home. It is very highly variable, though. It's still possible there's some bottleneck in the program that I've missed, but I think the variance-of-the-web may account for all of what I'm seeing at this point.

Using the Reactive Extensions for .NET combined with F#, you can write a very elegant solution - check out the sample at http://blog.paulbetts.org/index.php/2010/11/16/making-async-io-work-for-you-reactive-style/ (this uses C#, but using F# is easy too; the key is using the Begin/End methods instead of the sync method, which even if you can make it compile, it will block up n ThreadPool threads unnecessarily, instead of the Threadpool just picking up completion routines as they come in)

My bet is that the speedup you're experiencing is not significant enough for your taste because you're either using a subtype of WebRequest or a class relying on it (such as WebClient).
If that's the case, you need to set the MaxConnection on the ConnectionManagementElement (and I suggest you only set it if needed otherwise it's gonna become a pretty time-consuming operation) to a high value, depending on the number of simultaneous connections you wanna initiate from your application.

I'm not an F# guy, but from a pure .NET perspective what you're looking for is TaskFactory::FromAsync where the asynchronous call you'd be wrapping in a Task would be something like HttpRequest::BeginGetResponse. You could also wrap up the EAP model that WebClient exposes using a TaskCompletionSource. More on both of these topics here on MSDN.
Hopefully with this knowledge you can find the nearest native F# approach to accomplish what you're trying to do.

Here's some code that avoids the unknowns, such as web access latency. I am getting under 5% CPU utilization, and about 60-80% efficiency for both sync and async code paths.
open System.Diagnostics
let numWorkers = 200
let asyncDelay = 50
let main =
let codeBlocks = [for i in 1..numWorkers ->
async { do! Async.Sleep asyncDelay } ]
while true do
printfn "Concurrent started..."
let sw = new Stopwatch()
sw.Start()
codeBlocks |> Async.Parallel |> Async.RunSynchronously |> ignore
sw.Stop()
printfn "Concurrent in %d millisec" sw.ElapsedMilliseconds
printfn "efficiency: %d%%" (int64 (asyncDelay * 100) / sw.ElapsedMilliseconds)
printfn "Synchronous started..."
let sw = new Stopwatch()
sw.Start()
for codeBlock in codeBlocks do codeBlock |> Async.RunSynchronously |> ignore
sw.Stop()
printfn "Synchronous in %d millisec" sw.ElapsedMilliseconds
printfn "efficiency: %d%%" (int64 (asyncDelay * numWorkers * 100) / sw.ElapsedMilliseconds)
main

Related

F# Array.Parallel.map does not provide parallel processing

I have to simulate a discrete environment in F#, to be called by Python, for a reinforcement learning problem. I had a function with primitive types (mainly float) to make the exchange of data smoother. Now I am in the position to run this function many times with different data, so to run it in parallel seems a good idea.
I have the following code:
type AscentStrategy = |Strategy of seq<float>
let simulateAscent env ascentLimiter initState (sequenceOfDepths:seq<float>) =
//let infinitSeqOfConstantValues = (fun _ -> constantDepth) |> Seq.initInfinite
sequenceOfDepths
|> Seq.scan ( fun ( nextState, rew, isTerminal, _ ) depth -> getNextEnvResponseAndBoundForNextAction(env, nextState , depth , ascentLimiter) ) ( initState, 0.0 , false, 0.0)
|> SeqExtension.takeWhileWithLast (fun (_ , _, isTerminalState, _) -> not isTerminalState)
|> Seq.toArray
and then
let simulateStrategy ({MaxPDCS = maxPDCS ; MaxSimTime = maximumSimulationTime ; PenaltyForExceedingRisk = penaltyForExceedingRisk ;
RewardForDelivering = rewardForDelivering ; PenaltyForExceedingTime = penaltyForExceedingTime ; IntegrationTime = integrationTime
ControlToIntegrationTimeRatio = controlToIntegrationTimeRatio; DescentRate = descentRate; MaximumDepth = maximumDepth ;
BottomTime = bottomTime ; LegDiscreteTime = legDiscreteTime } : SimulationParameters) (Strategy ascentStrategy : AscentStrategy) =
let env, initState , ascentLimiter , _ = getEnvInitStateAndAscentLimiter ( maxPDCS , maximumSimulationTime ,
penaltyForExceedingRisk , rewardForDelivering , penaltyForExceedingTime ,
integrationTime ,
controlToIntegrationTimeRatio,
descentRate ,
maximumDepth ,
bottomTime ,
legDiscreteTime )
ascentStrategy
|> simulateAscent env ascentLimiter initState
finally I call the function for testing:
let commonSimulationParameters = {MaxPDCS = 0.32 ; MaxSimTime = 2000.0 ; PenaltyForExceedingRisk = 1.0 ; RewardForDelivering = 10.0; PenaltyForExceedingTime = 0.5 ;
IntegrationTime = 0.1; ControlToIntegrationTimeRatio = 10; DescentRate = 60.0; MaximumDepth = 20.0 ; BottomTime = 10.0; LegDiscreteTime = 0.1}
printfn"insert number of elements"
let maxInputsString = Console.ReadLine()
let maxInputs = maxInputsString |> Double.Parse
let inputsStrategies = [|0.0 .. maxInputs|] |> Array.map (fun x -> Seq.initInfinite (fun _ -> x ) )
let testParallel = inputsStrategies
|> Array.Parallel.map (fun x -> (simulateStrategy commonSimulationParameters ( Strategy x )) )
I have compared this with Array.map and, while it is faster and uses 70% of the CPU on my laptop, still does not seem to use the whole processing power. I have run it on a machine with many more cores ( ~50) and it barely increases the CPU usage (it gets up to 3/4% of total usage with 50ish independent inputs). I think there must be a deadlock generated somewhere, but how can I detect and get rid of it?
Also, why does this happen? One of the advantages of functional programming, as I see it, is also to be able to parallelize easily.
PS: SeqExtension.takeWhileWithLast is a function I have found on SO, kindly provided by Tomas Petricek in one of his brilliant answers, if needed I can post it.
PPS: env is the environment, whose type is defined as:
type Environment<'S, 'A ,'I> = |Environment of (State<'S> -> Action<'A> -> EnvironmentOutput<'S ,'I>)
I have tried the same with Async.Parallel and ParallelSeq, reporting the same problem.
Would a message-based solution solve the problem>? I am looking into it, although I am not familiar at all, but would it be a good way of getting the code parallel, using MailboxProcessor?
Following my question,
I have tried also this great library for parallel code, based on streams of data. https://nessos.github.io/Streams/.
I have added the following code:
let nessosResult = inputsStrategies
|> ParStream.ofArray
|> ParStream.map simulateStrategy
|> ParStream.toArray
I have defined an ad hoc type for inputStrategy (basic the old tuple I had) so that simulateStrategy accepts only one input. Unfortunately the problem seems very well hidden somewhere. I attach a graph with CPU usage. Time spent on my machine for the different cases is: ~8.8 sec (sequential); ~6.2 sec (Array.Parallel.map); ~ 6.1 sec (Nessos.Streams)
I have found that server garbage collection is necessary to get the best parallel performance on .NET. Something like this in your app.config:
<configuration>
<runtime>
<gcServer enabled="true" />
</runtime>
</configuration>

F# Type Provider slows down intellisense in Visual Studio 2017

I have a very simple type provider; all types are erased, the provided type has 2000 int readonly properties Tag1..Tag2000
let ns = "MyNamespace"
let asm = Assembly.GetExecutingAssembly()
let private newProperty t name getter isStatic = ProvidedProperty(name, t, getter, isStatic = isStatic)
let private newStaticProperty t name getter = newProperty t name (fun _ -> getter) true
let private newInstanceProperty t name getter = newProperty t name (fun _ -> getter) false
let private addStaticProperty t name getter (``type``:ProvidedTypeDefinition) = ``type``.AddMember (newStaticProperty t name getter); ``type``
let private addInstanceProperty t name getter (``type``:ProvidedTypeDefinition) = ``type``.AddMember (newInstanceProperty t name getter); ``type``
[<TypeProvider>]
type TypeProvider(config : TypeProviderConfig) as this =
inherit TypeProviderForNamespaces(config)
let provider = ProvidedTypeDefinition(asm, ns, "Provider", Some typeof<obj>, hideObjectMethods = true)
let tags = ProvidedTypeDefinition(asm, ns, "Tags", Some typeof<obj>, hideObjectMethods = true)
do [1..2000] |> Seq.iter (fun i -> addInstanceProperty typeof<int> (sprintf "Tag%d" i) <## i ##> tags |> ignore)
do provider.DefineStaticParameters([ProvidedStaticParameter("Host", typeof<string>)], fun name args ->
let provided = ProvidedTypeDefinition(asm, ns, name, Some typeof<obj>, hideObjectMethods = true)
addStaticProperty tags "Tags" <## obj() ##> provided |> ignore
provided
)
do this.AddNamespace(ns, [provider; tags])
Then a test project with two modules in separate files:
module Common
open MyNamespace
type Provided = Provider<"">
let providedTags = Provided.Tags
type LocalTags() =
member this.Tag1 with get() : int = 1
member this.Tag2 with get() : int = 2
.
.
member this.Tag1999 with get() : int = 1999
member this.Tag2000 with get() : int = 2000
let localTags = LocalTags()
module Tests
open Common
open Xunit
[<Fact>]
let ProvidedTagsTest () =
Assert.Equal<int>(providedTags.Tag1001, 1001)
[<Fact>]
let LocalTagsTest () =
Assert.Equal<int>(localTags.Tag100, 100)
Everything works as expected (tests execution included). The problem I have is with the design time behavior inside Visual Studio, while I write code. I expect to have some overhead due to the type provider, but the slowness seems frankly excessive. The times reported below are in seconds and refer to the time measured from pushing the dot (.) key until the intellisense property list appears on the screen
providedTags. -> 15
localTags. -> 5
If I comment out or remove the first test code lines (so to eliminate any references to the provided stuff), then I get
localTags. -> immediate
If the number of properties is greater, the time seems to increase exponentially, not linearly, so that at 10000 it becomes minutes.
Questions are:
Is this normal or am I doing something wrong?
Are there guidelines to achieve a faster response?
If someone is curious about why I need so many properties, I am trying to supply an instrument to data analysts so that they can write F# scripts and get data out of an historian database with more than 10000 tags in its schema.
Issue has been fixed by Don Syme, see
https://github.com/fsprojects/FSharp.TypeProviders.SDK/issues/220
and
https://github.com/fsprojects/FSharp.TypeProviders.SDK/pull/229

Lists garbage collection

If I do the following:
List2 = [V || V <- List1, ...]
It seems that the List2 refers to the List1 and erlang:garbage_collect() doesn't clear memory. How is it possible to create a new list without references and discard the old?
In any language with garbage collection you simply need to 'lose' all references to a piece of data before it can be garbage collected. Simply returning from the function that generates the original list, while not storing it in any other 'persistent' location (e.g. the process dictionary), should allow the memory to be reclaimed.
The VM is supposed to manage the garbage collecting. If you use a gen_server, or if you use a "home made" server_loop(State), you should have always the same pattern:
server_loop(State) ->
A = somefunc(State),
B = receive
mesg1 -> func1(...);
...
after Timeout ->
func2(...)
end,
NewState = func3(...),
server_loop(NewState).
As long as a process is alive, executing this loop, the VM will allocate and manage memory areas to store all needed information (variables, message queue...+ some margin) As far as I know, there is some spare memory allocated to the process, and if the VM does not try to recover the memory very fast after it has been released, but if you force a garbage collecting, using erlang:garbage_collect(Pid) you can verify that the memory is free - see example bellow.
startloop() -> spawn(?MODULE,loop,[{lists:seq(1,1000),infinity}]).
loop(endloop) -> ok;
loop({S,T}) ->
NewState = receive
biglist -> {lists:seq(1,5000000),T};
{timeout,V} -> {S,V};
sizelist -> io:format("Size of the list = ~p~n",[length(S)]),
{S,T};
endloop -> endloop
after T ->
L = length(S) div 2,
{lists:seq(1,L),T}
end,
loop(NewState).
%% Here, NewState is a copy of State or a totally new data, depending on the
%% received message. In general, for performance consideration it can be
%% interesting to take care of the function used to avoid big copies,
%% and allow the compiler optimize the beam code
%% [H|Q] rather than Q ++ [H] to add a term to a list for example
and the results in the VM:
2> P = lattice:startloop().
<0.57.0>
...
6> application:start(sasl).
....
ok
7> application:start(os_mon).
...
ok
...
11> P ! biglist.
biglist
...
% get_memory_data() -> {Total,Allocated,Worst}.
14> memsup:get_memory_data().
{8109199360,5346488320,{<0.57.0>,80244336}}
...
23> P ! {timeout,1000}.
{timeout,1000}
24> memsup:get_memory_data().
{8109199360,5367361536,{<0.57.0>,80244336}}
the worst case is the loop process: {<0.57.0>,80244336}
...
28> P ! sizelist.
Size of the list = 0
sizelist
...
31> P ! {timeout,infinity}.
{timeout,infinity}
32> P ! biglist.
biglist
33> P ! sizelist.
Size of the list = 5000000
sizelist
...
36> P ! {timeout,1000}.
{timeout,1000}
37> memsup:get_memory_data().
{8109199360,5314289664,{<0.57.0>,10770968}}
%% note the garbage collecting in the previous line: {<0.57.0>,10770968}
38> P ! sizelist.
sizelist
Size of the list = 156250
39> memsup:get_memory_data().
{8109199360,5314289664,{<0.57.0>,10770968}}
...
46> P ! sizelist.
Size of the list = 0
sizelist
47> memsup:get_memory_data().
{8109199360,5281882112,{<0.57.0>,10770968}}
...
50> erlang:garbage_collect(P).
true
51> memsup:get_memory_data().
{8109199360,5298778112,{<0.51.0>,688728}}
%% after GC, the process <0.57.0> is no more the worst case
If you create new list like this, the new list will have elements from the first one, some elements will be shared between both the lists. And if you throw the first list away, shared elements will still be reachable from the new list and won't count as garbage.
How do you check if the first list is garbage collected? Do you test this in erlang console? The console stores results of evaluation each expression that may be the cause you don't see the list garbage collected.

Simple debugging in Haskell

I am new to Haskell. Previously I have programmed in Python and Java. When I am debugging some code I have a habit of littering it with print statements in the middle of code. However doing so in Haskell will change semantics, and I will have to change my function signatures to those with IO stuff. How do Haskellers deal with this? I might be missing something obvious. Please enlighten.
Other answers link the official doco and the Haskell wiki but if you've made it to this answer let's assume you bounced off those for whatever reason. The wikibook also has an example using Fibonacci which I found more accessible. This is a deliberately basic example which might hopefully help.
Let's say we start with this very simple function, which for important business reasons, adds "bob" to a string, then reverses it.
bobreverse x = reverse ("bob" ++ x)
Output in GHCI:
> bobreverse "jill"
"llijbob"
We don't see how this could possibly be going wrong, but something near it is, so we add debug.
import Debug.Trace
bobreverse x = trace ("DEBUG: bobreverse" ++ show x) (reverse ("bob" ++ x))
Output:
> bobreverse "jill"
"DEBUG: bobreverse "jill"
llijbob"
We are using show just to ensure x is converted to a string correctly before output. We also added some parenthesis to make sure the arguments were grouped correctly.
In summary, the trace function is a decorator which prints the first argument and returns the second. It looks like a pure function, so you don't need to bring IO or other signatures into the functions to use it. It does this by cheating, which is explained further in the linked documentation above, if you are curious.
Read this. You can use Debug.Trace.trace in place of print statements.
I was able to create a dual personality IO / ST monad typeclass, which will print debug statements when a monadic computation is typed as IO, them when it's typed as ST. Demonstration and code here: Haskell -- dual personality IO / ST monad? .
Of course Debug.Trace is more of a swiss army knife, especially when wrapped with a useful special case,
trace2 :: Show a => [Char] -> a -> a
trace2 name x = trace (name ++ ": " ++ show x) x
which can be used like (trace2 "first arg" 3) + 4
edit
You can make this even fancier if you want source locations
{-# LANGUAGE TemplateHaskell #-}
import Language.Haskell.TH
import Language.Haskell.TH.Syntax as TH
import Debug.Trace
withLocation :: Q Exp -> Q Exp
withLocation f = do
let error = locationString =<< location
appE f error
where
locationString :: Loc -> Q Exp
locationString loc = do
litE $ stringL $ formatLoc loc
formatLoc :: Loc -> String
formatLoc loc = let file = loc_filename loc
(line, col) = loc_start loc
in concat [file, ":", show line, ":", show col]
trace3' (loc :: String) msg x =
trace2 ('[' : loc ++ "] " ++ msg) x
trace3 = withLocation [| trace3' |]
then, in a separate file [from the definition above], you can write
{-# LANGUAGE TemplateHaskell #-}
tr3 x = $trace3 "hello" x
and test it out
> tr3 4
[MyFile.hs:2:9] hello: 4
You can use Debug.Trace for that.
I really liked Dons short blog about it:
https://donsbot.wordpress.com/2007/11/14/no-more-exceptions-debugging-haskell-code-with-ghci/
In short: use ghci, example with a program with code called HsColour.hs
$ ghci HsColour.hs
*Main> :set -fbreak-on-exception
*Main> :set args "source.hs"
Now run your program with tracing on, and GHCi will stop your program at the call to error:
*Main> :trace main
Stopped at (exception thrown)
Ok, good. We had an exception… Let’s just back up a bit and see where we are. Watch now as we travel backwards in time through our program, using the (bizarre, I know) “:back” command:
[(exception thrown)] *Main> :back
Logged breakpoint at Language/Haskell/HsColour/Classify.hs:(19,0)-(31,46)
_result :: [String]
This tells us that immediately before hitting error, we were in the file Language/Haskell/HsColour/Classify.hs, at line 19. We’re in pretty good shape now. Let’s see where exactly:
[-1: Language/Haskell/HsColour/Classify.hs:(19,0)-(31,46)] *Main> :list
18 chunk :: String -> [String]
vv
19 chunk [] = head []
20 chunk ('\r':s) = chunk s -- get rid of DOS newline stuff
21 chunk ('\n':s) = "\n": chunk s
^^

F# 2010 Seq.generate_using

Is there an alternative/workaround for Seq.generate_using in Visual Studio 2010? The FSharp.PowerPack.dll is not available for 2010 AFAIK
(Sorry about the PowerPack still not being available for 2010 yet.)
I don't recall if this already is true of the CTP update, but in internal bits I get the warning:
This construct is deprecated. This
function will be removed in a future
release. If necessary, take a copy of
its implementation from the F#
PowerPack and copy it into your
application
so here is the code from the PowerPack:
#nowarn "9"
namespace Microsoft.FSharp.Compatibility
open System.Collections.Generic
module Seq =
let combine ie1 ie2 = Seq.zip ie1 ie2
let nonempty (ie : seq<'T>) = use e = ie.GetEnumerator() in e.MoveNext() //'
let generate openf compute closef =
seq { let r = openf()
try
let x = ref None
while (x := compute r; (!x).IsSome) do
yield (!x).Value
finally
closef r }
let generate_using (openf : unit -> ('b :> System.IDisposable)) compute = //'
generate openf compute (fun (s:'b) -> s.Dispose()) //'
let cons (x:'T) (s: seq<'T>) =
seq { yield x
yield! s }
FYI, the PowerPack binaries for .Net 4.0 Beta1 came online today:
http://www.microsoft.com/downloads/details.aspx?FamilyID=e475a670-9596-4958-bfa2-dc0ac29b4631&displaylang=en

Resources