How to realize nested parallel in R on windows platform - windows

I try to use parApply() in parallel package in R.
cl <- makeCluster(16)
cl.boot <-makeCluster(8)
In my programme, I call t(parApply(cl,rv,1,sim.one.test)) first. In function sim.one.test,a call a function boot(). And in boot(), I use
bs.resample <- t(parApply(cl.boot,rv.boot,1,function(x) bs.mle(n1,n2,x,s,t1,t2,m,theta)))
Simply, the outer function is sim.one.test() and inner one is bs.mle().
The error information is invalid connection. I guess this is because nested parallel is not supported. From another questions on stackoverflow, it is suggested I should use mcapply() which only can be applied on Linux but I run the programme on Windows platform. Is there any solution for nested parallel computing on Windows platform? Thanks.

Why do you think you need nested parallelization? That will just increase your parallelization overhead (if it works at all, which I doubt). Conceptually, it's by far better to only parallelize the outer loop (provided it contains enough iterations and is more or less load-balanced).
However, you could use nested foreach loops with a parallel backend. That would convert your nested loops into one loop before sending it to the workers.

Related

OpenMDAO External Code Component with mpi

I am trying to optimize an airfoil using openMDAO and SU2. I have multiple Designpoints that i want to run in parallel. I managed to do that with a "Parallel Group" and XFoil. But i now want to use SU2 instead of XFoil.
The Big Problem is, SU2 by itself, is started by MPI (mpirun-np 4 SU2_CFD config.cfg). Now i want openMDAO to divide all the available processes evenly to all DesignPoints. And then run one SU2 instance per Designpoint. Every SU2 instance should then use all the processes that openMDAO allocated to that DesginPoint.
How could i do that?
Probably wrong approach:
I played around with the external-code component. But if this component gets 2 processes, it is run twice. I dont want to run SU2 twice. I want to run it once, but using both available processes.
Best Regards
David
I don't think your approach to wrapping SU2 is going to work, if you want to run it in parallel as part of a larger model. ExternalCodeComp is designed for file-wrapping and spawns sub-processes, which doesn't give you any way to share MPI communicators with the parent process (that I know of anyway).
Im not an expert in SU2, so I can't speak to their python interface. But Im quite confident that ExternalCodeComp isn't going to give you what you want here. I suggest you talk to the SU2 developers to discuss their in-memory interface.
I couldn't figure out a simple way. But I discorvered ADflow: https://github.com/mdolab/adflow.
It is a CFD-Solver that comes shipped with an OpenMDAO-Wrapper. So I am going to use that.

BigQuery JavaScript UDF process - per row or per processing node?

I'm thinking of using BigQuery's JavaScript UDF as a critical component in a new data architecture. It would be used to logically process each row loaded into the main table, and also to process each row during periodical and ad-hoc aggregation queries.
Using an SQL UDF for the same purpose seems to be unfeasible because each row represents a complex object, and implementing the business logic in SQL, including things such as parsing complex text fields, gets ugly very fast.
I just read the following in the Optimizing query computation documentation page:
Best practice: Avoid using JavaScript user-defined functions. Use native UDFs instead.
Calling a JavaScript UDF requires the instantiation of a subprocess.
Spinning up this process and running the UDF directly impacts query
performance. If possible, use a native (SQL) UDF instead.
I understand why a new process for each processing node is needed, and I know that JS tends to be deployed in a single-thread-per-process manner (even though v8 does support multithreading these days). But it's not clear to me if once a JS runtime process is up, it can be expected to get reused between calls to the same function (e.g. for processing different rows on the same processing node). The amount of reuse will probably significantly affect the cost. My table is not that large (tens to hundreds of millions of rows), but still I need to have a better understanding here.
I could not find any authoritative source on this. Has anybody done any analysis of the actual impact of using a JavaScript UDF on each processed row, in terms of execution time and cost?
If it's not documented, then that's an implementation detail that could change. But let's test it:
CREATE TEMP FUNCTION randomThis(views INT64)
RETURNS FLOAT64
LANGUAGE js AS """
if (typeof variable === 'undefined') {
variable = Math.random()
}
return variable
""";
SELECT randomThis(views), COUNT(*) c
FROM (
SELECT views
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
LIMIT 10000000
)
GROUP BY 1
ORDER BY 2 DESC
I was expecting ten million different numbers, or a handful, but I only got one: The same process was reused ten million times, and variables were kept around in between calls.
This even happened when I went up to 100 million, signaling that parallelism is bounded by one JS VM.
Again, these are implementation details that could change. But while it stays that way, you can make the best use out of it.
I was expecting ten million different numbers, or a handful, but I only got one
That's because you didn't allow Math.random to be called more than once
and variables were kept around in between calls
due to the variable defined at the global scope.
In other words your code explicitly permits Math.random to be executed once only (by implictly defining the variable at the global scope).
If you try this:
CREATE TEMP FUNCTION randomThis(seed INT64)
RETURNS FLOAT64
LANGUAGE js AS """
let ret = undefined
if (ret === undefined) {
ret = Math.random()
}
return ret
""";
SELECT randomThis(size), COUNT(*) c
FROM (
SELECT repository_size as size
FROM `my-internal-dataset.sample-github-table`
LIMIT 10000000
)
GROUP BY 1
ORDER BY 2 DESC
then you get many rows. And now it does take much longer time to execute, probably because the single VM became a bottleneck.
Used another dataset to reduce the query cost.
Conclusion:
1. There is one VM (or maybe a container) per query to support JS UDF. This is in line with a single subprocess ("Calling a JavaScript UDF requires the instantiation of a subprocess") mentioned in the documentation.
2. If you can apply execute-once pattern (using some kind of a cache or coding technique like memoisation) and write a UDF similar to the previous answer, then the sheer presence of JS UDF has a limited impact on your query.
3. If you have to write a JS UDF like in this answer, then the impact on your query becomes very significant with query execution time skyrocketing even for simple JS code. So for this case it's certainly better to stay out.

How do I use CAS (compare and set) operation

How can I read-modify-write the same variable from multiple GPU threads? In C++AMP I used the standard lib's compare-and-set function, but I haven't found an example in AleaGPU.
I know the goal is to avoid such things, but without getting into much detail I'll say its pretty necessary for my code.
There is an API in AleaGPU: http://www.aleagpu.com/release/3_0_3/api/html/64c9ca47-2e8e-265b-d968-15345e374320.htm
The usage is described here: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomiccas

OpenAcc : How To parallelize the function calls

I am working on a project,i am trying to parallelize the application.
there are some functions which i am trying to parallelize but the problem is these function call other functions very frequently.loops are only for computation and there are many loops in one function body.
I know OpenACC does not support function calls(only inline calls) within its directive,So i have came up with two approaches :
a) either Just put OpenAcc Directive around the loops and get the required parallelism and ignore the function call (not just ignore it just keep it as it is )(do this in each and every function body)
b) or I can put the called the function body inside the calling function then the overhead of thread creation multiple times when entering the acc directive is minimized ( by including a large number of loops in one block).but this seems to be much of a headache becuase the function bodies are large ( about 4000-5000 lines of code).
I can't figure out how to handle such scenario.
in summary i need to find an efficient way to parallelize the function calls in OpenACC
As some Mark Ebersole said, OpenACC 2.0 is the solution. The routine directive in 2.0 allows marking functions as device targets.

NSThread or pythons' threading module in pyobjc?

I need to do some network bound calls (e.g., fetch a website) and I don't want it to block the UI. Should I be using NSThread's or python's threading module if I am working in pyobjc? I can't find any information on how to choose one over the other. Note, I don't really care about Python's GIL since my tasks are not CPU bound at all.
It will make no difference, you will gain the same behavior with slightly different interfaces. Use whichever fits best into your system.
Learn to love the run loop. Use Cocoa's URL-loading system (or, if you need plain sockets, NSFileHandle) and let it call you when the response (or failure) comes back. Then you don't have to deal with threads at all (the URL-loading system will use a thread for you).
Pretty much the only time to create your own threads in Cocoa is when you have a large task (>0.1 sec) that you can't break up.
(Someone might say NSOperation, but NSOperationQueue is broken and RAOperationQueue doesn't support concurrent operations. Fine if you already have a bunch of NSOperationQueue code or really want to prepare for working NSOperationQueue, but if you need concurrency now, run loop or threads.)
I'm more fond of the native python threading solution since I could join and reference threads around. AFAIK, NSThreads don't support thread joining and cancelling, and you could get a variety of things done with python threads.
Also, it's a bummer that NSThreads can't have multiple arguments, and though there are workarounds for this (like using NSDictionarys and NSArrays), it's still not as elegant and as simple as invoking a thread with arguments laid out in order / corresponding parameters.
But yeah, if the situation demands you to use NSThreads, there shouldn't be any problem at all. Otherwise, it's cool to stick with native python threads.
I have a different suggestion, mainly because python threading is just plain awful because of the GIL (Global Interpreter Lock), especially when you have more than one cpu core. There is a video presentation that goes into this in excruciating detail, but I cannot find the video right now - it was done by a Google employee.
Anyway, you may want to think about using the subprocess module instead of threading (have a helper program that you can execute, or use another binary on the system. Or use NSThread, it should give you more performance than what you can get with CPython threads.

Resources