F# Array.Parallel.map does not provide parallel processing

F# Array.Parallel.map does not provide parallel processing - parallel-processing

I have to simulate a discrete environment in F#, to be called by Python, for a reinforcement learning problem. I had a function with primitive types (mainly float) to make the exchange of data smoother. Now I am in the position to run this function many times with different data, so to run it in parallel seems a good idea.
I have the following code:
type AscentStrategy = |Strategy of seq<float>
let simulateAscent env ascentLimiter initState (sequenceOfDepths:seq<float>) =
//let infinitSeqOfConstantValues = (fun _ -> constantDepth) |> Seq.initInfinite
sequenceOfDepths
|> Seq.scan ( fun ( nextState, rew, isTerminal, _ ) depth -> getNextEnvResponseAndBoundForNextAction(env, nextState , depth , ascentLimiter) ) ( initState, 0.0 , false, 0.0)
|> SeqExtension.takeWhileWithLast (fun (_ , _, isTerminalState, _) -> not isTerminalState)
|> Seq.toArray
and then
let simulateStrategy ({MaxPDCS = maxPDCS ; MaxSimTime = maximumSimulationTime ; PenaltyForExceedingRisk = penaltyForExceedingRisk ;
RewardForDelivering = rewardForDelivering ; PenaltyForExceedingTime = penaltyForExceedingTime ; IntegrationTime = integrationTime
ControlToIntegrationTimeRatio = controlToIntegrationTimeRatio; DescentRate = descentRate; MaximumDepth = maximumDepth ;
BottomTime = bottomTime ; LegDiscreteTime = legDiscreteTime } : SimulationParameters) (Strategy ascentStrategy : AscentStrategy) =
let env, initState , ascentLimiter , _ = getEnvInitStateAndAscentLimiter ( maxPDCS , maximumSimulationTime ,
penaltyForExceedingRisk , rewardForDelivering , penaltyForExceedingTime ,
integrationTime ,
controlToIntegrationTimeRatio,
descentRate ,
maximumDepth ,
bottomTime ,
legDiscreteTime )
ascentStrategy
|> simulateAscent env ascentLimiter initState
finally I call the function for testing:
let commonSimulationParameters = {MaxPDCS = 0.32 ; MaxSimTime = 2000.0 ; PenaltyForExceedingRisk = 1.0 ; RewardForDelivering = 10.0; PenaltyForExceedingTime = 0.5 ;
IntegrationTime = 0.1; ControlToIntegrationTimeRatio = 10; DescentRate = 60.0; MaximumDepth = 20.0 ; BottomTime = 10.0; LegDiscreteTime = 0.1}
printfn"insert number of elements"
let maxInputsString = Console.ReadLine()
let maxInputs = maxInputsString |> Double.Parse
let inputsStrategies = [|0.0 .. maxInputs|] |> Array.map (fun x -> Seq.initInfinite (fun _ -> x ) )
let testParallel = inputsStrategies
|> Array.Parallel.map (fun x -> (simulateStrategy commonSimulationParameters ( Strategy x )) )
I have compared this with Array.map and, while it is faster and uses 70% of the CPU on my laptop, still does not seem to use the whole processing power. I have run it on a machine with many more cores ( ~50) and it barely increases the CPU usage (it gets up to 3/4% of total usage with 50ish independent inputs). I think there must be a deadlock generated somewhere, but how can I detect and get rid of it?
Also, why does this happen? One of the advantages of functional programming, as I see it, is also to be able to parallelize easily.
PS: SeqExtension.takeWhileWithLast is a function I have found on SO, kindly provided by Tomas Petricek in one of his brilliant answers, if needed I can post it.
PPS: env is the environment, whose type is defined as:
type Environment<'S, 'A ,'I> = |Environment of (State<'S> -> Action<'A> -> EnvironmentOutput<'S ,'I>)
I have tried the same with Async.Parallel and ParallelSeq, reporting the same problem.
Would a message-based solution solve the problem>? I am looking into it, although I am not familiar at all, but would it be a good way of getting the code parallel, using MailboxProcessor?
Following my question,
I have tried also this great library for parallel code, based on streams of data. https://nessos.github.io/Streams/.
I have added the following code:
let nessosResult = inputsStrategies
|> ParStream.ofArray
|> ParStream.map simulateStrategy
|> ParStream.toArray
I have defined an ad hoc type for inputStrategy (basic the old tuple I had) so that simulateStrategy accepts only one input. Unfortunately the problem seems very well hidden somewhere. I attach a graph with CPU usage. Time spent on my machine for the different cases is: ~8.8 sec (sequential); ~6.2 sec (Array.Parallel.map); ~ 6.1 sec (Nessos.Streams)

I have found that server garbage collection is necessary to get the best parallel performance on .NET. Something like this in your app.config:
<configuration>
<runtime>
<gcServer enabled="true" />
</runtime>
</configuration>

Related

Macro average (iterate over classes) in custom training loop - tensorflow

I am using a custom training loop. The task is a multi-label multi-class classification, i.e. I have multiple classes I want to predict and each class admits multiple labels. loss has dimensions batch_size, no_classes, as said before each col in no_classes is a multi-label classification task. The following code works when #tf.function is commented out, however once graph mode is on, this is not working since iterating over tensor is not allowed in graph mode. Would anyone be able to suggest how I can rewrite the code below so that it works in graph mode?
items_loss_list = []
for item in range(loss.shape[1]):
values, _ = tf.unique(y[:, item])
item_macro_average = tf.reduce_mean(
[
tf.reduce_mean(
tf.gather_nd(
loss[:, item],
indices=tf.cast(tf.where(y[:, item] == v), tf.int32),
)
)
for v in values
]
)
items_loss_list.append(item_macro_average)
I also tried:
i = tf.constant(0)
while_condition = lambda i: tf.less(i, len(values))
item_score_avg = []
def body(i):
item_score_avg.append(
tf.reduce_mean(
tf.gather_nd(
loss[:, item],
indices=tf.cast(tf.where(y[:, item] == values[i]), tf.int32),
)
)
)
return [tf.add(i, 1)]
tf.while_loop(while_condition, body, [i])
items_loss_list.append(tf.reduce_mean(item_score_avg))
But this is not working either in graph mode. Thank you for your help!

Apparently map_fn solves the problem. items_macro_average is a list collecting the macro average loss per task.
items_macro_average = []
for item in range(loss.shape[1]):
values, _ = tf.unique(y[:, item])
item_macro_average = tf.reduce_mean(
tf.map_fn(
fn=lambda v: tf.reduce_mean(
tf.gather_nd(
loss[:, item],
indices=tf.cast(tf.where(y[:, item] == v), tf.int32),
)
),
elems=values,
)
)
items_macro_average.append(item_macro_average)

F# Type Provider slows down intellisense in Visual Studio 2017

I have a very simple type provider; all types are erased, the provided type has 2000 int readonly properties Tag1..Tag2000
let ns = "MyNamespace"
let asm = Assembly.GetExecutingAssembly()
let private newProperty t name getter isStatic = ProvidedProperty(name, t, getter, isStatic = isStatic)
let private newStaticProperty t name getter = newProperty t name (fun _ -> getter) true
let private newInstanceProperty t name getter = newProperty t name (fun _ -> getter) false
let private addStaticProperty t name getter (``type``:ProvidedTypeDefinition) = ``type``.AddMember (newStaticProperty t name getter); ``type``
let private addInstanceProperty t name getter (``type``:ProvidedTypeDefinition) = ``type``.AddMember (newInstanceProperty t name getter); ``type``
[<TypeProvider>]
type TypeProvider(config : TypeProviderConfig) as this =
inherit TypeProviderForNamespaces(config)
let provider = ProvidedTypeDefinition(asm, ns, "Provider", Some typeof<obj>, hideObjectMethods = true)
let tags = ProvidedTypeDefinition(asm, ns, "Tags", Some typeof<obj>, hideObjectMethods = true)
do [1..2000] |> Seq.iter (fun i -> addInstanceProperty typeof<int> (sprintf "Tag%d" i) <## i ##> tags |> ignore)
do provider.DefineStaticParameters([ProvidedStaticParameter("Host", typeof<string>)], fun name args ->
let provided = ProvidedTypeDefinition(asm, ns, name, Some typeof<obj>, hideObjectMethods = true)
addStaticProperty tags "Tags" <## obj() ##> provided |> ignore
provided
)
do this.AddNamespace(ns, [provider; tags])
Then a test project with two modules in separate files:
module Common
open MyNamespace
type Provided = Provider<"">
let providedTags = Provided.Tags
type LocalTags() =
member this.Tag1 with get() : int = 1
member this.Tag2 with get() : int = 2
.
.
member this.Tag1999 with get() : int = 1999
member this.Tag2000 with get() : int = 2000
let localTags = LocalTags()
module Tests
open Common
open Xunit
[<Fact>]
let ProvidedTagsTest () =
Assert.Equal<int>(providedTags.Tag1001, 1001)
[<Fact>]
let LocalTagsTest () =
Assert.Equal<int>(localTags.Tag100, 100)
Everything works as expected (tests execution included). The problem I have is with the design time behavior inside Visual Studio, while I write code. I expect to have some overhead due to the type provider, but the slowness seems frankly excessive. The times reported below are in seconds and refer to the time measured from pushing the dot (.) key until the intellisense property list appears on the screen
providedTags. -> 15
localTags. -> 5
If I comment out or remove the first test code lines (so to eliminate any references to the provided stuff), then I get
localTags. -> immediate
If the number of properties is greater, the time seems to increase exponentially, not linearly, so that at 10000 it becomes minutes.
Questions are:
Is this normal or am I doing something wrong?
Are there guidelines to achieve a faster response?
If someone is curious about why I need so many properties, I am trying to supply an instrument to data analysts so that they can write F# scripts and get data out of an historian database with more than 10000 tags in its schema.

Issue has been fixed by Don Syme, see
https://github.com/fsprojects/FSharp.TypeProviders.SDK/issues/220
and
https://github.com/fsprojects/FSharp.TypeProviders.SDK/pull/229

How to speed up MATLAB integration?

I have the following code:
function [] = Solver( t )
%pre-declaration
foo=[1,1,1];
fooCell = num2cell(foo);
[q, val(q), star]=fooCell{:};
%functions used in prosomoiwsh
syms q val(q) star;
qd1=symfun(90*pi/180+30*pi/180*cos(q),q);
qd2=symfun(90*pi/180+30*pi/180*sin(q),q);
p1=symfun(79*pi/180*exp(-1.25*q)+pi/180,q);
p2=symfun(79*pi/180*exp(-1.25*q)+pi/180,q);
e1=symfun(val-qd1,q);
e2=symfun(val-qd2,q);
T1=symfun(log(-(1+star)/star),star);
T2=symfun(log(star/(1-star)),star);
%anonymous function handles
lambda=[0.75;10.494441313222076];
calcEVR_handles={#(t,x)[double(subs(diff(subs(T1,star,e1/p1),q)+subs(lambda(1)*T1,star,e1/p1),{diff(val,q);val;q},{x(2);x(1);t})),double(subs(diff(subs(T1,star,e1/p1),q)+subs(lambda(1)*T1,star,e1/p1),{diff(val,q);val;q},{0;x(1);t})),double(subs(double(subs(subs(diff(T1,star),star,e1/p1),{val;q},{x(1);t}))/p1,q,t))];#(t,x)[double(subs(diff(subs(T2,star,e2/p2),q)+subs(lambda(2)*T2,star,e2/p2),{diff(val,q);val;q},{x(4);x(3);t})),double(subs(diff(subs(T2,star,e2/p2),q)+subs(lambda(2)*T2,star,e2/p2),{diff(val,q);val;q},{0;x(3);t})),double(subs(double(subs(subs(diff(T2,star),star,e2/p2),{val;q},{x(3);t}))/p2,q,t))]};
options = odeset('AbsTol',1e-1,'RelTol',1e-1);
[T,x_r] = ode23(#prosomoiwsh,[0 t],[80*pi/180;0;130*pi/180;0;2.4943180186983711;11.216948999754299],options);
save newresult T x_r
function dx_th = prosomoiwsh(t,x_th)
%declarations
k=0.80773938740480955;
nf=6.2860930902603602;
hGa=0.16727117784664769;
hGb=0.010886618389781832;
dD=0.14062935253218495;
s=0.64963817519705203;
IwF={[4.5453398382686956 5.2541234145178066 -6.5853972592002235 7.695225990702979];[-4.4358339284697337 -8.1138542053372298 -8.2698210582548395 3.9739729629084071]};
IwG={[5.7098975358444752 4.2470526600975802 -0.83412489434697168 0.53829395964565041] [1.8689492167233894 -0.0015017513794517434 8.8666804106266461 -1.0775021663921467];[6.9513235639494155 -0.8133752392893685 7.4032432556804162 3.1496138243338709] [5.8037182454981568 2.0933267947187457 4.852362963697928 -0.10745559204132382]};
IbF={-1.2165533594615545;7.9215291787744917};
IbG={2.8425752327892844 2.5931576770598168;9.4789237295474873 7.9378928037841252};
p=2;
m=2;
signG=1;
n_vals=[2;2];
nFixedStates=4;
gamma_nn=[0.31559428834175318;9.2037894041383641];
th_star_guess=[2.4943180186983711;11.216948999754299];
%solution
x = x_th(1:nFixedStates);
th = x_th(nFixedStates+1:nFixedStates+p);
f = zeros(m,1);
G = zeros(m,m);
ZF = zeros(p,m);
ZG = zeros(p,m,m);
for i=1:m
[f(i), ZF(:,i)] = calculate_neural_output(x, IwF{i}, IbF{i}, th);
for j=1:m
[G(i,j), ZG(:,i,j)] = calculate_neural_output(x, IwG{i,j}, IbG{i,j}, th);
end
end
detG = det(G);
if m == 1
adjG = 1;
else
adjG = detG*G^-1;
end
E = zeros(m,1);
V = zeros(m,1);
R = zeros(m,m);
for i=1:m
EVR=calcEVR_handles{i}(t,x);
E(i)=EVR(1);
V(i)=EVR(2);
R(i,i)=EVR(3);
end
Rinv = R^-1;
prod_R_E = R*E;
ub = f + Rinv * (V + k*E) + nf*prod_R_E;
ua = - detG / (detG^2+dD) * (adjG * ub) ;
u = ua - signG * (hGa*(ua'*ua) + hGb*(ub'*ub)) * prod_R_E;
dx_th = zeros(nFixedStates+p, 1); %preallocation
%System in form (1) of the IEEE paper
[vec_sys_f, vec_sys_G] = sys_f_G(x);
dx_nm = vec_sys_f + vec_sys_G*u;
%Calculation of dx
index_start = 1;
index_end = -1;
for i=1:m
index_end = index_end + n_vals(i);
for j=index_start:index_end
dx_th(j) = x(j+1);
end
dx_th(index_end+1) = dx_nm(i);
index_start = index_end + 2;
end
%Calculation of dth
AFvalueT = zeros(p,m);
for i=1:m
AFvalueT(:,i) = 0;
for j=1:m
AFvalueT(:,i) = AFvalueT(:,i)+ZG(:,i,j)*ua(j);
end
end
dx_th(nFixedStates+1:nFixedStates+p) = diag(gamma_nn)*( (ZF+AFvalueT)*prod_R_E -s*(th-th_star_guess) );
display(t)
end
function [y, Z] = calculate_neural_output(input, Iw, Ib, state)
Z = [tanh(Iw*input+Ib);1];
y = state' * Z;
end
function [ f,g ] = sys_f_G( x )
Iz1=0.96;
Iz2=0.81;
m1=3.2;
m2=2.0;
l1=0.5;
l2=0.4;
g=9.81;
q1=x(1);
q2=x(3);
q1dot=x(2);
q2dot=x(4);
M=[Iz1+Iz2+m1*l1^2/4+m2*(l1^2+l2^2/4+l1*l2*cos(q2)),Iz2+m2*(l2^2/4+l1*l2*cos(q2)/2);Iz2+m2*(l2^2/4+l1*l2*cos(q2)/2),Iz2+m2*l2^2/4];
c=0.5*m2*l1*l2*sin(q2);
C=[-c*q2dot,-c*(q1dot+q2dot);c*q1dot,0];
G=[0.5*m1*g*l1*cos(q1)+m2*g*(l1*cos(q1)+0.5*l2*cos(q1+q2));0.5*m2*g*l2*cos(q1+q2)];
f=-M\(C*[q1dot;q2dot]+G);
g=inv(M);
end
end
Its target is to simulate the control of a 2-DOF robotic arm using a certain control law. The results I get after running the simulation are correct(I have a graph of the output I should expect), but it takes ages to finish!
Is there anything I could do to speed up the process?

In order to improve the computational speed of any integration in Matlab, a few options are available to you:
Reduce the required accuracy (which you already have done)
Use an adapted integrator. As mentioned by #sanchises, sometimes ode23 can be longer than another ode solver in Matlab (if your equation is stiff for instance). You could try to determine which solver is most adapted from the documentation... Or simply try them all!
The best solution, but by far the most time consuming, would be to use a compiled language, such as C or Fortran. If the integration is but a part of your Matlab program, you could use Mex files, and translate only the integration to a compiled language. You could also create dynamic libraries in your compiled language and load them in Matlab using loadlibrary. I use loadlibrary and an integration routine written in Fortran for the integration of orbits and trajectories, and I get over 100 times speedup with Fortran vs. Matlab! Of course, technically, the integration is not in Matlab anymore... But the library or Mex files trick allows you to only convert the integration part of your program to a different language! A number of open source integrators are available, such as ODEPACK or RKSUITE in Fortran. Then, you only need to create a wrapper and your dynamics function in the correct language.
So to put it in a nutshell, if you're going to use this integration a lot, I would advise using a compiled language. If not, you should make do with Matlab, and be patient!

Simple debugging in Haskell

I am new to Haskell. Previously I have programmed in Python and Java. When I am debugging some code I have a habit of littering it with print statements in the middle of code. However doing so in Haskell will change semantics, and I will have to change my function signatures to those with IO stuff. How do Haskellers deal with this? I might be missing something obvious. Please enlighten.

Other answers link the official doco and the Haskell wiki but if you've made it to this answer let's assume you bounced off those for whatever reason. The wikibook also has an example using Fibonacci which I found more accessible. This is a deliberately basic example which might hopefully help.
Let's say we start with this very simple function, which for important business reasons, adds "bob" to a string, then reverses it.
bobreverse x = reverse ("bob" ++ x)
Output in GHCI:
> bobreverse "jill"
"llijbob"
We don't see how this could possibly be going wrong, but something near it is, so we add debug.
import Debug.Trace
bobreverse x = trace ("DEBUG: bobreverse" ++ show x) (reverse ("bob" ++ x))
Output:
> bobreverse "jill"
"DEBUG: bobreverse "jill"
llijbob"
We are using show just to ensure x is converted to a string correctly before output. We also added some parenthesis to make sure the arguments were grouped correctly.
In summary, the trace function is a decorator which prints the first argument and returns the second. It looks like a pure function, so you don't need to bring IO or other signatures into the functions to use it. It does this by cheating, which is explained further in the linked documentation above, if you are curious.

Read this. You can use Debug.Trace.trace in place of print statements.

I was able to create a dual personality IO / ST monad typeclass, which will print debug statements when a monadic computation is typed as IO, them when it's typed as ST. Demonstration and code here: Haskell -- dual personality IO / ST monad? .
Of course Debug.Trace is more of a swiss army knife, especially when wrapped with a useful special case,
trace2 :: Show a => [Char] -> a -> a
trace2 name x = trace (name ++ ": " ++ show x) x
which can be used like (trace2 "first arg" 3) + 4
edit
You can make this even fancier if you want source locations
{-# LANGUAGE TemplateHaskell #-}
import Language.Haskell.TH
import Language.Haskell.TH.Syntax as TH
import Debug.Trace
withLocation :: Q Exp -> Q Exp
withLocation f = do
let error = locationString =<< location
appE f error
where
locationString :: Loc -> Q Exp
locationString loc = do
litE $ stringL $ formatLoc loc
formatLoc :: Loc -> String
formatLoc loc = let file = loc_filename loc
(line, col) = loc_start loc
in concat [file, ":", show line, ":", show col]
trace3' (loc :: String) msg x =
trace2 ('[' : loc ++ "] " ++ msg) x
trace3 = withLocation [| trace3' |]
then, in a separate file [from the definition above], you can write
{-# LANGUAGE TemplateHaskell #-}
tr3 x = $trace3 "hello" x
and test it out
> tr3 4
[MyFile.hs:2:9] hello: 4

You can use Debug.Trace for that.

I really liked Dons short blog about it:
https://donsbot.wordpress.com/2007/11/14/no-more-exceptions-debugging-haskell-code-with-ghci/
In short: use ghci, example with a program with code called HsColour.hs
$ ghci HsColour.hs
*Main> :set -fbreak-on-exception
*Main> :set args "source.hs"
Now run your program with tracing on, and GHCi will stop your program at the call to error:
*Main> :trace main
Stopped at (exception thrown)
Ok, good. We had an exception… Let’s just back up a bit and see where we are. Watch now as we travel backwards in time through our program, using the (bizarre, I know) “:back” command:
[(exception thrown)] *Main> :back
Logged breakpoint at Language/Haskell/HsColour/Classify.hs:(19,0)-(31,46)
_result :: [String]
This tells us that immediately before hitting error, we were in the file Language/Haskell/HsColour/Classify.hs, at line 19. We’re in pretty good shape now. Let’s see where exactly:
[-1: Language/Haskell/HsColour/Classify.hs:(19,0)-(31,46)] *Main> :list
18 chunk :: String -> [String]
vv
19 chunk [] = head []
20 chunk ('\r':s) = chunk s -- get rid of DOS newline stuff
21 chunk ('\n':s) = "\n": chunk s
^^

How to achieve Asynchrony instead of Parallelism in F#

(Sticking to a common example with async fetch of many web pages)
How would I spin off multiple (hundreds) of web page requests asynchronously, and then wait for all requests to complete before going to the next step? Async.AsParallel processes a few requests at a time, controlled by number of cores on the CPU. Grabbing a web page is not a CPU-bound operation. Not satisfied with the speedup of Async.AsParallel, I am looking for alternatives.
I tried to connect the dots between Async.StartAsTask and Task[].WaitAll. Instinctively, I wrote the following code, but it does not compile.
let processItemsConcurrently (items : int seq) =
let tasks = items |> Seq.map (fun item -> Async.StartAsTask(fetchAsync item))
Tasks.Task.WaitAll(tasks)
How would you approach this?

Async.Parallel is almost definitely right here. Not sure what you're not happy with; the strength of F# asyncs lies more in async computing than in task-parallel CPU-bound stuff (which is more tailored to Tasks and the .NET 4.0 TPL). Here's a full example:
open System.Diagnostics
open System.IO
open System.Net
open Microsoft.FSharp.Control.WebExtensions
let sites = [|
"http://bing.com"
"http://google.com"
"http://cnn.com"
"http://stackoverflow.com"
"http://yahoo.com"
"http://msdn.com"
"http://microsoft.com"
"http://apple.com"
"http://nfl.com"
"http://amazon.com"
"http://ebay.com"
"http://expedia.com"
"http://twitter.com"
"http://reddit.com"
"http://hulu.com"
"http://youtube.com"
"http://wikipedia.org"
"http://live.com"
"http://msn.com"
"http://wordpress.com"
|]
let print s =
// careful, don't create a synchronization bottleneck by printing
//printf "%s" s
()
let printSummary info fullTimeMs =
Array.sortInPlaceBy (fun (i,_,_) -> i) info
// for i, size, time in info do
// printfn "%2d %7d %5d" i size time
let longest = info |> Array.map (fun (_,_,time) -> time) |> Array.max
printfn "longest request took %dms" longest
let bytes = info |> Array.sumBy (fun (_,size,_) -> float size)
let seconds = float fullTimeMs / 1000.
printfn "sucked down %7.2f KB/s" (bytes / 1024.0 / seconds)
let FetchAllSync() =
let allsw = Stopwatch.StartNew()
let info = sites |> Array.mapi (fun i url ->
let sw = Stopwatch.StartNew()
print "S"
let req = WebRequest.Create(url)
use resp = req.GetResponse()
use stream = resp.GetResponseStream()
use reader = new StreamReader(stream,
System.Text.Encoding.UTF8, true, 4096)
print "-"
let contents = reader.ReadToEnd()
print "r"
i, contents.Length, sw.ElapsedMilliseconds)
let time = allsw.ElapsedMilliseconds
printSummary info time
time, info |> Array.sumBy (fun (_,size,_) -> size)
let FetchAllAsync() =
let allsw = Stopwatch.StartNew()
let info = sites |> Array.mapi (fun i url -> async {
let sw = Stopwatch.StartNew()
print "S"
let req = WebRequest.Create(url)
use! resp = req.AsyncGetResponse()
use stream = resp.GetResponseStream()
use reader = new AsyncStreamReader(stream, // F# PowerPack
System.Text.Encoding.UTF8, true, 4096)
print "-"
let! contents = reader.ReadToEnd() // in F# PowerPack
print "r"
return i, contents.Length, sw.ElapsedMilliseconds })
|> Async.Parallel
|> Async.RunSynchronously
let time = allsw.ElapsedMilliseconds
printSummary info time
time, info |> Array.sumBy (fun (_,size,_) -> size)
// By default, I think .NET limits you to 2 open connections at once
ServicePointManager.DefaultConnectionLimit <- sites.Length
for i in 1..3 do // to warmup and show variance
let time1,r1 = FetchAllSync()
printfn "Sync took %dms, result was %d" time1 r1
let time2,r2 = FetchAllAsync()
printfn "Async took %dms, result was %d (speedup=%2.2f)"
time2 r2 (float time1/ float time2)
printfn ""
On my 4-core box, this consistently gives a nearly 4x speedup.
EDIT
In reply to your comment, I've updated the code. You're right in that I've added more sites and am not seeing the expected speedup (still holding steady around 4x). I've started adding a little debugging output above, will continue investigating to see if something else is throttling the connections...
EDIT
Editted the code again. Well, I found what might be the bottleneck. Here's the implementation of AsyncReadToEnd in the PowerPack:
type System.IO.StreamReader with
member s.AsyncReadToEnd () =
FileExtensions.UnblockViaNewThread (fun () -> s.ReadToEnd())
In other words, it just blocks a threadpool thread and reads synchronously. Argh!!! Let me see if I can work around that.
EDIT
Ok, the AsyncStreamReader in the PowerPack does the right thing, and I'm using that now.
However, the key issue seems to be variance.
When you hit, say, cnn.com, a lot of the time the result will come back in like 500ms. But every once in a while you get that one request that takes 4s, and this of course potentially kills the apparent async perf, since the overall time is the time of the unluckiest request.
Running the program above, I see speedups from about 2.5x to 9x on my 2-core box at home. It is very highly variable, though. It's still possible there's some bottleneck in the program that I've missed, but I think the variance-of-the-web may account for all of what I'm seeing at this point.

Using the Reactive Extensions for .NET combined with F#, you can write a very elegant solution - check out the sample at http://blog.paulbetts.org/index.php/2010/11/16/making-async-io-work-for-you-reactive-style/ (this uses C#, but using F# is easy too; the key is using the Begin/End methods instead of the sync method, which even if you can make it compile, it will block up n ThreadPool threads unnecessarily, instead of the Threadpool just picking up completion routines as they come in)

My bet is that the speedup you're experiencing is not significant enough for your taste because you're either using a subtype of WebRequest or a class relying on it (such as WebClient).
If that's the case, you need to set the MaxConnection on the ConnectionManagementElement (and I suggest you only set it if needed otherwise it's gonna become a pretty time-consuming operation) to a high value, depending on the number of simultaneous connections you wanna initiate from your application.

I'm not an F# guy, but from a pure .NET perspective what you're looking for is TaskFactory::FromAsync where the asynchronous call you'd be wrapping in a Task would be something like HttpRequest::BeginGetResponse. You could also wrap up the EAP model that WebClient exposes using a TaskCompletionSource. More on both of these topics here on MSDN.
Hopefully with this knowledge you can find the nearest native F# approach to accomplish what you're trying to do.

Here's some code that avoids the unknowns, such as web access latency. I am getting under 5% CPU utilization, and about 60-80% efficiency for both sync and async code paths.
open System.Diagnostics
let numWorkers = 200
let asyncDelay = 50
let main =
let codeBlocks = [for i in 1..numWorkers ->
async { do! Async.Sleep asyncDelay } ]
while true do
printfn "Concurrent started..."
let sw = new Stopwatch()
sw.Start()
codeBlocks |> Async.Parallel |> Async.RunSynchronously |> ignore
sw.Stop()
printfn "Concurrent in %d millisec" sw.ElapsedMilliseconds
printfn "efficiency: %d%%" (int64 (asyncDelay * 100) / sw.ElapsedMilliseconds)
printfn "Synchronous started..."
let sw = new Stopwatch()
sw.Start()
for codeBlock in codeBlocks do codeBlock |> Async.RunSynchronously |> ignore
sw.Stop()
printfn "Synchronous in %d millisec" sw.ElapsedMilliseconds
printfn "efficiency: %d%%" (int64 (asyncDelay * numWorkers * 100) / sw.ElapsedMilliseconds)
main

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio