nesting gnu parallels over many nodes - parallel-processing

I wish so run a loop, using gnu parallels that has the following form
outer loop about 5-10 iterations can be parallelized
loop that cant be parallelized
inner loop about 100s operations in parallel
inner loop bit that can't be parallelized
due to the nature of unparallelizable loop there is no way of making this 1 parallel command.
I would like a minimum of one more of the outer loop to start to there is something to do whilst the inner loop is operating in it's single threaded bit.
I would like it so that not significantly more jobs than CPU cores are started, though this seems difficult to implement when nesting parallel commands.
Worse, if I specify a node file for the inner loop, it would seem that the nodes are going to be used in the same order every time the inner loop is run - leading to very poor scaling as the node count grows. There is the obvious method of use multiple node files that have the nodes in a different order.
Is a good way of controlling the number of simultaneous jobs is a nest like this ? the use of a named semaphore seems a possibility, passing a different node file each time run seems clumsy, or am using the wrong tool?

The easy solution is simply to only parallelize the inner loop.
But --load may also work for you:
inner() {
# Do inner stuff
}
export -f inner
outer() {
parallel --load 100% -S .. --env inner inner ::: {1..100}
}
export -f outer
parallel outer ::: {5..10}

Related

performance of running select queries in parallel

(note this isnt about parallel execution of a query inside the RDB, but peformance characteristics of submitting queries in parallel).
I have a process that executes 1000's (if not 10,000s+) of queries, in a single threaded manner (i.e. send, wait for response, process, send....), loosely of the form
select a,b from table where id = 123
i.e. query a single record on an already indexed field
on an oracle database.
This process takes longer than desired, and doing some metrics on it, I'm sure that 90% of the time is spent server side execution (and transport) rather than client side.
This process naturally can be split into N 'jobs', and its been suggested that this could/should speed up the process.
Naively you would expect it to run N times quicker (with a small overhead to merge the answer).
Given that (loosely) SQL is 'serialised' is this the case though? That would imply that actually it would probably not run quicker at all.
I assume that for an update on a single record (for example) N updates would have to be effectively serialised, but for N reads, this may not be the case.
Which theory is most accurate (or neither)
I'm not a dba, it looks like for reads, reads never block reads, so assuming infinite resources the theory would be that N reads could be run completely in parallel with no blocking. For writes and reads it gets more complex depending on how you set up your transactions/locks but thats out of scope for me.

Best way to split computation too large to fit into memory?

I have an operation that's running out of memory when using a batch size greater than 4 (I normally run with 32). I thought I could be clever by splitting this one operation along the batch dimension, using tf.split, running it on a subset of the batch, and then recombining using tf.concat. For some reason this doesn't work and results in an OOM error. Just to be clear, if I run on a batch size of 4 it works without splitting. If instead I run on a batch size of 32, and even if I were to perform a 32-way split so that each individual element is run independently, I still run out of memory. Doesn't TF schedule separate operations so that they do not overwhelm the memory? If not do I need to explicitly set up some sort of conditional dependence?
I discovered that the functional ops, specifically map_fn in this case, address my needs. By setting the parallel_iterations option to 1 (or some small number that would make the computation fit in memory), I'm able to control the degree of parallelism and avoid running out of memory.

How to understand the progress bar for take() in spark-shell

I called take() method of an RDD[LabeledPoint] from spark-shell, which seemed to be a laborious job for spark.
The spark-shell shows a progress-bar:
The progress-bar fills again and again and I don't know how to produce a reasonable estimate of the needed time (or total-progress ) from those numbers above.
Does anyone know what those numbers mean ?
Thanks in advance.
The numbers show the Spark stage that is running, the number of completed, in-progress, and total tasks in the stage. (See
What do the numbers on the progress bar mean in spark-shell? for more on the progress bar.)
Spark stages run tasks in parallel. In your case 5 tasks are running in parallel at the moment. If each task takes roughly the same time, this should give you an idea of how much longer you have to wait for this stage to finish.
But RDD.take can take more than one stage. take(1) will first get the first element of the first partition. If the first partition is empty, it will take the first elements from the second, third, fourth, and fifth partitions. The number of partitions it looks at in each stage is 4× the number of partitions already checked. So if you have a whole lot of empty partitions, take(1) can take many iterations. This can be the case for example if you have a large amount of data, then do filter(_.name == "John").take(1).
If you know your result will be small, you can save time by using collect instead of take(1). This will always gather all the data in a single stage. The main advantage is that in this case all the partitions will be processed in parallel, instead of the somewhat sequential manner of take.

What are the differences between sequential consistency and quiescent consistency?

Can anyone explain me the definitions and differences between sequential consistency and quiescent consistency? In the most dumb form possible :|
I did read this: Example of execution which is sequentially consistent but not quiescently consistent
But I am not able to understand Sequential and quiescent consistency itself :(
Sequential consistency requires that the operations should appear to take effect in the order they are specified in each program. Basically it enforces program order within each individual process, and allows all processes to assume they are observing the same order of operations. Let's say we have 2 processes enqueuing and dequeuing items on a queue q:
P1 -- q.enq(x) -----------------------------
P2 -------------- q.enq(y) ---- q.deq():y --
This is not the expected behaviour from a FIFO queue. We'd expect to dequeue x because P1 enqueues x before P2 enqueues y. However this scenario is allowed in the sequential consistency model because sequential consistency doesn't require that the order seen by all processes is correct (real-time order). There's at least one sequential execution that can explain these results and one is:
P2:q.enq(y) P1:q.enq(x) P2:q.deq():y
In this execution each process executes operations in program order meaning each process executes its operations in the order they're specified in each process.
Quiescent consistency requires non-overlapping operations to appear to take effect in their real-time order, but overlapping operations might be reordered. Therefore, the same scenario is not allowed in the quiescent consistency model because we expect q.enq(x) to appear to take effect before q.enq(y), and q.deq() to return x instead of y. Also quiescent consistency doesn't necessarily preserve program order. If q.enq(x) and q.enq(y) would be concurrent (overlapping) operations, they could be reordered and q.deq():y would be quiescently consistent.
Basically some executions are sequentially consistent but not quiescently consistent, and vice versa.
First you should understand what is program order, it is literally how you expect your program runs in the order of the appearance of instructions.
But a program order is only for a single thread program, if you have multithreads, then problem comes as the program order may not hold or even not exist as sometimes you cannot tell which thread's method call happens first.
A quiescent consistency describes a clear program order of all threads' behaviors. that is no overlaps are allowed since it is required a quiescent period between two method calls.
A sequential consistency allows overlaps, but requires one can find a program order in which all the method calls can be put in a place and still returns correct value and behaves correctly.

Controlling number of lines to be written to the output file

I am new to Hadoop programming.
I have a situation in which I want to stop writing <k3,v3> to my output file after n-lines.
In my program, I am sure that the output file will be sorted according to k3, but I don't want the entire list. I only want the first n.
Is there a mechanism in Hadoop to do this?
I couldn't find an Class/API for the same.
But, you could increment a Counter when the OutputCollector.collect() is called in the Reduce function. When the counter reaches the a certain value, stop calling the OutputCollector.collect().
It's a waste of CPU cycles because the reduce tasks keeps on running even after n lines are written to the o/p. There might be a better approach for the problem.

Resources