Joblib, Parallel and batch_size - parallel-processing

Let's say I have a generator/list of size 50000. And, I want to use it as:
batches = range(0, 50001)
Parallel(n_jobs=multiprocessing.cpu_count(), verbose=100,
backend="threading", batch_size=?, pre_dispatch=?)(delayed(<function>)(it) for it in batches)
Can you please tell me what would be the correct value in batch_size and pre_dispatch if I want to process 20,000 items at a single time/or I want a thread/cpu to process 20,000 items as a single task?

To process on a single CPU 20000 items, batch_size=20000.
Predispatch is by default sent to 2CPU-s. In the logic of keeping the CPU-s busy, probably it should be 2batch_size, so pre_dispatch=40000. It is also tempting to set pre_dispatch='all' however that works if tasks are roughly equivalent in time. 2*batch_size, to avoid one CPU being idle, if there is a big discrepancy in time/task.
In my own experience, it is better to have a higher batch_size if each operation is fast, to reduce I/O.

Related

Is there a way to calculate progress rate without total process count?

I think this is difficult thing.
In general I know that I need total and current count for gaining rate to something.
But in this case, I cannot get total count.
For example, there are two jobs, A and B.
Their total process will be always set randomly.
Also, I cannot get job's total process count before job be ended.
I have one of method that set concreted rate each jobs like if A is done, set rate 50%.
But in this situation that A's count is 10 and B's count is 1000 will make strange result.
Although total count is 1010, it is 50% that 10 process is done.
It is something strange.
So, I want to offer more natural progress rate to users. But I don't have total process count.
Is there any useful method alternative generic percentage calculation?
If you want to know how much total progress you have without knowing how much total progress there could be, this is logically impossible
However, you could
estimate it
keep historical data
assume the maximum and just surprise the user when it's faster
To instead show the rate of progress
take the current time at the start of your process and subtract the time when you check again
divide the completed jobs by that amount to get the jobs/second
Roughly
rate = jobs_completed / (time_now - time_start)
You can also do this over some window, but you need to record both the time and the number of jobs completed at the start of the window to subtract off both to get just the jobs in your time window
rate_windowed = (jobs_completed - jobs_previous) / (time_now - time_previous)

Achieving interactive large-dataset map-reduce on AWS/GCE in the least lines of code / script

I have 1 billion rows of data (about 400GB uncompressed; about 40GB compressed) that I would like to process in map-reduce style, and I have two executables (binaries, not scripts) that can handle the "map" and "reduce" steps. The "map" step can process about 10,000 rows per second, per core, and its output is approximately 1MB in size, regardless of the size of its input. The "reduce" step can process about 50MB / second (excluding IO latency).
Assume that I can pre-process the data once, to do whatever I'd like such as compress it, break it into pieces, etc. For simplicity, assume input is plain text and each row terminates with a newline and each newline is a row terminator.
Once that one-time pre-processing is complete, the goal is to be able to execute a request within 30 seconds. So, if my only bottleneck is the map job (which I don't know will really be true-- it could very well be the IO), and assuming I can do all the reduce jobs in under 5 seconds, then I would need about 425 8-core computers, all processing different parts of the input data, to complete the run in time.
Assuming you have the data, and the two map/reduce executables, and you have unlimited access to AWS or GCE, what is a solution to this problem that I can implement with the fewest lines of code and/or script (and not ignoring potential IO or other non-CPU bottlenecks)?
(As an aside, it would be interesting to also knowing what would execute with the fewest nodes, if different from the solution with fewest SLOC)

Redis CPU performance on sorted sets

we are running redis and doing hundreds of increments per second of keys in a sorted set, and at the same time doing thousands of reads on the sorted set every second as well.
This seems to be working well but during peak load cpu usage gets pretty high, 80% of a single core. The sorted set itself is a small memory footprint of a few thousand keys.
is the cpu usage increase likely to be due to the hundreds of increments per second or the thousands of reads? understand both impact performance but which has the larger impact?
given this what are some of the best metrics to monitor on my production instance to review these bottlenecks?
One point to check is whether the sorted sets are small enough to be serialized by Redis or not. For instance the "debug object" could be applied on a sample of sorted sets to check if they are encoded as ziplist or not.
ziplist usage trade memory against CPU, especially when the size of the sorted set is close to threshold (zset-max-ziplist-entries, zset-max-ziplist-value, in the configuration file).
Supposing the sorted sets are not ziplist encoded, I would say CPU usage is likely due to the thousands of reads per sec rather than the hundreds of updates per sec. An update of a zset is a log(n) operation. It is very fast, and there is no locking related latency with Redis. A read of the zset items is a O(n) operation, and may result in a large buffer to build and return to the client.
To be sure, you may want to generate the read only traffic, check the CPU, then stop it, generate the update traffic, check the CPU again and compare.
The zset read operations performance should be close to the LRANGE performance you can find in the Redis benchmark. A few thousands of TPS for zsets featuring a thousand of items seem to be in line with typical Redis performance.

How can I determine the appropriate number of tasks with GCD or similar?

I very often encounter situations where I have a large number of small operations that I want to carry out independently. In these cases, the number of operations is so large compared to the actual time each operation takes so simply creating a task for each operation is inappropriate due to overhead, even though GCD overhead is typically low.
So what you'd want to do is split up the number of operations into nice chunks where each task operates on a chunk. But how can I determine the appropriate number of tasks/chunks?
Testing, and profiling. What makes sense, and what works well is application specific.
Basically you need to decide on two things:
The number of worker processes/threads to generate
The size of the chunks they will work on
Play with the two numbers, and calculate their throughput (tasks completed per second * number of workers). Somewhere you'll find a good equilibrium between speed, number of workers, and number of tasks in a chunk.
You can make finding the right balance even simpler by feeding your workers a bunch of test data, essentially a benchmark, and measuring their throughput automatically while adjusting these two variables. Record the throughput for each combination of worker size/task chunk size, and output it at the end. The highest throughput is your best combination.
Finally, if how long a particular task takes really depends on the task itself (e.g. some tasks take X time, and while some take X*3 time, then you can can take a couple of approaches. Depending on the nature of your incoming work, you can try one of the following:
Feed your benchmark historical data - a bunch of real-world data to be processed that represents the actual kind of work that will come into your worker grid, and measure throughput using that example data.
Generate random-sized tasks that cross the spectrum of what you think you'll see, and pick the combination that seems to work best on average, across multiple sizes of tasks
If you can read the data in a task, and the data will give you an idea of whether or not that task will take X time, or X*3 (or something in between) you can use that information before processing the tasks themselves to dynamically adjust the worker/task size to achieve the best throughput depending on current workload. This approach is taken with Amazon EC2 where customers will spin-up extra VMs when needed to handle higher load, and spin them back down when load drops, for example.
Whatever you choose, any unknown speed issue should almost always involve some kind of demo benchmarking, if the speed at which it runs is critical to the success of your application (sometimes the time to process is so small, that it's negligible).
Good luck!

Performance problem: CPU intensive work performs better with more concurrency in Erlang

tl;dr
I'm getting better performance with my erlang program when I perform my CPU intensive tasks at higher concurrency (e.g. 10K at once vs. 4). Why?
I'm writing a map reduce framework using erlang, and I'm doing performance tests.
My map function is highly CPU intensive (mostly pure calculation). It also needs access to some static data, so I have a few persistent (lingering i.e. lives through the app. life cycle) worker processes on my machine, each having a part of this data in-memory, and awaiting map requests. The output of map is sent to the manager process (which sent out the map requests to the workers), where the reduce (very lightweight) is performed.
Anyways, I noticed that I'm getting better throughput when I immediately spawn a new process for each map request that the workers receives, rather than letting the worker process itself synchronously perform the map request by itself one-by-one (thus leaving bunch of map requests in its process queue, because I'm firing the map requests all at once).
Code snippet:
%% When I remove the comment, I get significant performance boost (90% -> 96%)
%% spawn_link(fun()->
%% One invocation uses around 250ms of CPU time
do_map(Map, AssignedSet, Emit, Data),
Manager ! {finished, JobId, self(), AssignedSet, normal},
%% end),
Compared to when I perform the same calculation in a tight loop, I get 96% throughput (efficiency) using the "immediately spawning" method (e.g. 10000 map reduce jobs running completely in parallel). When I use the "worker performs one-by-one" method, I get only around 90%.
I understand Erlang is supposed to be good at concurrent stuff, and I'm impressed that efficiency doesn't change even if I perform 10K map reduce requests at once as opposed to 100 etc! However, since I have only 4 CPU cores, I'd expect to get better throughput if I use lower concurrency like 4 or maybe 5.
Weirdly, my CPU usage looks very similar in the 2 different implementation (almost completely pegged at 100% on all cores). The performance difference is quite stable. I.e. even when I just do 100 map reduce jobs, I still get around 96% efficiency with the "immediately spawn" method, and around 90% when I use "one-by-one" method. Likewise when I test with 200, 500, 1000, 10K jobs.
I first suspected that the queuing at the worker process queue is the culprit, but even when I should only have something like 25 messages in the worker process queue, I still see the lower performance. 25 messages seem to be quite small for causing a clog (I am doing selective message matching, but not in a way the process would have to put messages back to the queue).
I'm not sure how I should proceed from here. Am I doing something wrong, or am I completely missing something??
UPDATE
I did some more tests and found out that the performance difference can disappear depending on conditions (particularly into how many worker process I divide the static data). Looks like I have much more to learn!
Assuming 1 worker process with 3 map actions, we have the first variants:
_______ _______ _______
| m | | m | | m |
| | | | | |
_| |_| |_| |_
a a a r
Where a is administrative tasks (reading from the message queue, dispatching the map etc.) m is the actual map and r is sending back the result. The second variant where a process is spawned for every map:
_________________._
| m r
| ___________________._
| | m r
| | _____________________._
_|_|_| m r
a a a
As you can see, there's both administrative tasks (a) going on as the same time as maps (m) and as the same times as sending back results (r).
This will keep the CPU busy with map (i.e. calculation intensive) work all the time, as opposed to having short dips every now and then. This is most likely the small gain you see in throughput.
As you have quite high concurrency from the beginning, you only see a relatively small gain in throughput. Compare this to theoretically running only one worker process (as in the first variant) you'll see much bigger gains.
First, let me remarkt this is a very interesting question. I'd like to give you some hints:
Task switching occours per run queue ([rq:x] in the shell) due to reduction: if the Erlang process calls a BIF or user-defined function, it increases it's reduction counter. When running CPU intensive code in one process, it increases it's reduction counter very often. When the reduction counter reaches a certain threshold, a process-switch will occur. (So one process with longer life-time has the same overhead as a multiple processes with shorter life-time: they both have the "same" total reduction counter and fire it when it reaches a threshold, e.g. the one process: 50,000 reductions, more processes: 5 * 10,000 reductions = 50,000 reductions.) (Runtime reasons)
Running on 4 cores vs. 1 core makes a difference: however, timing is the difference. The reason your cores are at 100% is because one or more core(s) is/are doing the mapping, while the other(s) are/is effectively "filling" your message queue. When you are spawning the mapping, there is less time to "fill" the message queue, more time to do the mapping. Apparently, mapping is a more costly operation than filling the queue, and giving it more cores thus increases performance. (Timing/tuning reasons)
You'll get higher throughput when you increase concurrency levels, if processes are waiting (receiving/calling OTP servers/etc.). For instance: requesting data from your static persistent workers takes some time. (Language reasons)

Resources