OpenAcc : How To parallelize the function calls - parallel-processing

I am working on a project,i am trying to parallelize the application.
there are some functions which i am trying to parallelize but the problem is these function call other functions very frequently.loops are only for computation and there are many loops in one function body.
I know OpenACC does not support function calls(only inline calls) within its directive,So i have came up with two approaches :
a) either Just put OpenAcc Directive around the loops and get the required parallelism and ignore the function call (not just ignore it just keep it as it is )(do this in each and every function body)
b) or I can put the called the function body inside the calling function then the overhead of thread creation multiple times when entering the acc directive is minimized ( by including a large number of loops in one block).but this seems to be much of a headache becuase the function bodies are large ( about 4000-5000 lines of code).
I can't figure out how to handle such scenario.
in summary i need to find an efficient way to parallelize the function calls in OpenACC

As some Mark Ebersole said, OpenACC 2.0 is the solution. The routine directive in 2.0 allows marking functions as device targets.

Related

Best Practices for Multiple OnEdit Functions

Problem
I have 6 OnEdit Functions, which work as intended individually, but when together they don't work as intended. By this I mean some simply don't trigger.
Properties of the Script
They have different names - function onEdit(e) {code}, function onEdit1(e1) {code}, function onEdit2(e2) {code}, function onEdit3(e3) {code}, function onEdit4(e4) {code}, function onEdit5(e5) {code}
They are all in the same .gs tab
Some of them have the same variables. For example OnEdit has var range = e.range; and OnEdit5 has var range = e5.range;
My Understanding
I believe that you can run multiple OnEdit functions within the same .gs tab. Is this correct? Or do I need to somehow create new .gs tabs?
I believe that my onEdit functions should be named differently, so they are called correctly. Is this correct, or should I be getting rid of the different functions and putting them into one massive function? (I imagine this would lead to slower execution and more cases of not being able to isolate incorrect code).
I believe that the variables that are created within each function are specific to that function. Is this true? Or are they impacting each other?
Why I'm asking this
Iterations of this question seem to have been asked before. But people generally give advice on integrating two functions into one big one, rather than preparing someone to integrate 10-20 different OnEdit functions. Nor do they give a clear indication of best coding practices.
I've spent hours reading through this subject and feel that people new to scripts, like me, would greatly benefit from knowing this.
Thank you in advance for any contributions!
Notes:
There can only be one function with a same name. If there are two, the latter will overwrite the former. It's like the former never existed.
A function named onEdit is triggered automatically on (You guessed it!)edit
There's no simple trigger for other names like onEdit1 or onEdit2....
Simple triggers are limited to 30 seconds of execution
So, in a single code.gs file or even in a single project, there can only be one function named onEdit and trigger successfully.
If you create multiple projects, onEdit will trigger in each project asynchronously. But there are limits to number of projects that can be created and other quotas will apply.
Alternatively, you can use installed triggers: which doesn't have limit of 30s. You can also use any name for your function.
The best way to optimize functions is to never touch the spreadsheet unless it is absolutely necessary. For example, sorting various values inside the script is better than repeatedly calling .sort on the multiple ranges multiple times. The lesser the interaction between sheets and scripts, the better. A highly optimized script will only require two calls to spreadsheet: one to get the data and the other to set the data.
After optimizing the number of calls to sheet, you can optimize the script itself: Control the logic such that only the necessary amount of operations are done for each edit. For example, if the edit is in A1(A1,B1 are checkboxes, if clicked clears A2:A10,B2:B10 respectively), then you should check if A1 is clicked and If clicked, clear the range and exit and not to check for B1 again. Script optimization requires atleast a basic knowledge of JavaScript objects. Nevertheless, this isn't as effective as reducing the number of calls-which is the slowest part of any apps script.
References:
Best practices

BigQuery JavaScript UDF process - per row or per processing node?

I'm thinking of using BigQuery's JavaScript UDF as a critical component in a new data architecture. It would be used to logically process each row loaded into the main table, and also to process each row during periodical and ad-hoc aggregation queries.
Using an SQL UDF for the same purpose seems to be unfeasible because each row represents a complex object, and implementing the business logic in SQL, including things such as parsing complex text fields, gets ugly very fast.
I just read the following in the Optimizing query computation documentation page:
Best practice: Avoid using JavaScript user-defined functions. Use native UDFs instead.
Calling a JavaScript UDF requires the instantiation of a subprocess.
Spinning up this process and running the UDF directly impacts query
performance. If possible, use a native (SQL) UDF instead.
I understand why a new process for each processing node is needed, and I know that JS tends to be deployed in a single-thread-per-process manner (even though v8 does support multithreading these days). But it's not clear to me if once a JS runtime process is up, it can be expected to get reused between calls to the same function (e.g. for processing different rows on the same processing node). The amount of reuse will probably significantly affect the cost. My table is not that large (tens to hundreds of millions of rows), but still I need to have a better understanding here.
I could not find any authoritative source on this. Has anybody done any analysis of the actual impact of using a JavaScript UDF on each processed row, in terms of execution time and cost?
If it's not documented, then that's an implementation detail that could change. But let's test it:
CREATE TEMP FUNCTION randomThis(views INT64)
RETURNS FLOAT64
LANGUAGE js AS """
if (typeof variable === 'undefined') {
variable = Math.random()
}
return variable
""";
SELECT randomThis(views), COUNT(*) c
FROM (
SELECT views
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
LIMIT 10000000
)
GROUP BY 1
ORDER BY 2 DESC
I was expecting ten million different numbers, or a handful, but I only got one: The same process was reused ten million times, and variables were kept around in between calls.
This even happened when I went up to 100 million, signaling that parallelism is bounded by one JS VM.
Again, these are implementation details that could change. But while it stays that way, you can make the best use out of it.
I was expecting ten million different numbers, or a handful, but I only got one
That's because you didn't allow Math.random to be called more than once
and variables were kept around in between calls
due to the variable defined at the global scope.
In other words your code explicitly permits Math.random to be executed once only (by implictly defining the variable at the global scope).
If you try this:
CREATE TEMP FUNCTION randomThis(seed INT64)
RETURNS FLOAT64
LANGUAGE js AS """
let ret = undefined
if (ret === undefined) {
ret = Math.random()
}
return ret
""";
SELECT randomThis(size), COUNT(*) c
FROM (
SELECT repository_size as size
FROM `my-internal-dataset.sample-github-table`
LIMIT 10000000
)
GROUP BY 1
ORDER BY 2 DESC
then you get many rows. And now it does take much longer time to execute, probably because the single VM became a bottleneck.
Used another dataset to reduce the query cost.
Conclusion:
1. There is one VM (or maybe a container) per query to support JS UDF. This is in line with a single subprocess ("Calling a JavaScript UDF requires the instantiation of a subprocess") mentioned in the documentation.
2. If you can apply execute-once pattern (using some kind of a cache or coding technique like memoisation) and write a UDF similar to the previous answer, then the sheer presence of JS UDF has a limited impact on your query.
3. If you have to write a JS UDF like in this answer, then the impact on your query becomes very significant with query execution time skyrocketing even for simple JS code. So for this case it's certainly better to stay out.

For loops run SUPER slow in Julia when outside a function

There is a very peculiar slow down in Julia. When running, for example, a for loop by calling a function
function TestFunc(num)
for i=1:num
end
end
It is MUCH faster than when I just run a for loop for the exact same num ...
for i=1:num
end
The slow down isn't marginal either, it is magnitudes slower, the following image shows me running it.
For Loop Code
In some of my other code, the opposite actually happens but I just feel like I am missing something fundamental about the way Julia runs. How do I keep my code optimal and why do these differences exist?
Anything you can write outside a function, you can write inside a function. So just like in C, you can write
function main()
print("Hello World\n")
end
main()
So just pretend it is a C program and write your stuff inside the main() function.
Why is it so slow outside a function, it is because any variable inside a function is protected from being changed by another thread or task. So for a for loop in the global scope must check its variables for its type everytime it is access by the for loop, just in case it was change by another thread or task. All these checking is slowing it down FOR SAFETY.
The first Performance Law of Julia is
Global is slow
The performance tips in the Julia Documentation says
A global variable might have its value, and therefore its type, change at any point. This makes it difficult for the compiler to optimize code using global variables. Variables should be local, or passed as arguments to functions, whenever possible.
Any code that is performance critical or being benchmarked should be inside a function.

What does "emit" mean in general computer science terms?

I just stumbled on what appears to be a generally-known compsci keyword, "emit". But I can't find any clear definition of it in general computer science terms, nor a specific definition of an "emit()" function or keyword in any specific programming language.
I found it here, reading up on MapReduce:
https://en.wikipedia.org/wiki/MapReduce
The context of my additional searches show it has something to do with signaling and/or events. But it seems like it is just assumed that the reader will know what "emit" is and does. For example, this article on MapReduce patterns:
https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
There's no mention of what "emit" is actually doing, there are only calls to it. It must be different from other forms of returning data, though, such as "return" or simply "printf" or the equivalent, else the calls to "emit" would be calls to "return".
Further searching, I found a bunch of times that some pseudocode form of "emit" appears in the context of MapReduce. And in Node.js. And in Qt. But that's about it.
Context: I'm a (mostly) self-taught web programmer and system administrator. I'm sure this question is covered in compsci 101 (or 201?) but I didn't take that course.
In the context of web and network programming:
When we call a function the function may return a value.
When we call a function and the function is supposed to send those results to another function we will not user return anymore. Instead we use emit. We expect the function to emit the results to another function by our call.
A function can return results and emit events.
I've only ever seen emit() used when building a simple compiler in academia.
Upon analyzing the grammar of a program, you tokenize the contents of it and emit (push out) assembly instructions. (The compiler program that was written actually even contained an internal function called emit to mirror that theoretical/logical aspect of it.)
Once the grammar analysis is complete, the assembler will take the assembly instructions and generate the binary code (aka machine code).
So, I don't think there is a general CS definition for emit; however, I do know it is used in the pseudocode (and sometimes, actual code) for writing compiler programs. And that is undergraduate level computer science education in the US.
I can think of three contexts in which it's used:
Map/Reduce functions, where some input value causes 0 or more output values to go into the Reduce function
Tokenizers, where a stream of text is processed, and at various intervals, tokens are emitted
Messaging systems
I think the common thread is the "zero or more". A return provides exactly one value back from a function, whereas an "emit" is a function call that could take place zero times or several times.
In the context of the MapReduce programming model, it is said that an operation of a map nature takes an input value and emits a result, which is nothing more than a transformation of the input.

How to realize nested parallel in R on windows platform

I try to use parApply() in parallel package in R.
cl <- makeCluster(16)
cl.boot <-makeCluster(8)
In my programme, I call t(parApply(cl,rv,1,sim.one.test)) first. In function sim.one.test,a call a function boot(). And in boot(), I use
bs.resample <- t(parApply(cl.boot,rv.boot,1,function(x) bs.mle(n1,n2,x,s,t1,t2,m,theta)))
Simply, the outer function is sim.one.test() and inner one is bs.mle().
The error information is invalid connection. I guess this is because nested parallel is not supported. From another questions on stackoverflow, it is suggested I should use mcapply() which only can be applied on Linux but I run the programme on Windows platform. Is there any solution for nested parallel computing on Windows platform? Thanks.
Why do you think you need nested parallelization? That will just increase your parallelization overhead (if it works at all, which I doubt). Conceptually, it's by far better to only parallelize the outer loop (provided it contains enough iterations and is more or less load-balanced).
However, you could use nested foreach loops with a parallel backend. That would convert your nested loops into one loop before sending it to the workers.

Resources