Spring Reactor Webflux Scheduler Parallelism

Spring Reactor Webflux Scheduler Parallelism - spring

For a fully non-blocking end to end reactive calls, is it recommended to explicitly call publishOn or subscribeOn to switch schedulers? For either cpu consuming or non consuming tasks, is it favorable to always use parallel flux to optimize performance?

For a fully non-blocking end to end reactive calls, is it recommended to explicitly call publishOn or subscribeOn to switch schedulers?
publishOn is used when you publish data to downstream while subscribeOn is used when you consume data from upstream. So it really depends on what kind of job you want to perform.
For either cpu consuming or non consuming tasks, is it favorable to always use parallel flux to optimize performance?
Absolutely not, Consider this example:
Flux.range(1, 10)
.parallel(4)
.runOn(Schedulers.parallel())
.sequential()
.elapsed()
.subscribe(i -> System.out.printf(" %s ", i));
The above code is a total waste because i will be processed almost instantly. Following code will perform better than above:
Flux.range(1, 10)
.elapsed()
.subscribe(i -> System.out.printf(" %s ", i));
Now consider this:
public static <T> T someMethodThatBlocks(T i, int ms) {
try { Thread.sleep( ms ); }
catch (InterruptedException e) {}
return i;
}
// some method here
Flux.range(1, 10)
.parallel(4)
.runOn(Schedulers.parallel())
.map(i -> someMethodThatBlocks(i, 200))
.sequential()
.elapsed()
.subscribe(i -> System.out.printf(" %s ", i));
The output is similar to:
[210,3] [5,1] [0,2] [0,4] [196,6] [0,8] [0,5] [4,7] [196,10] [0,9]
As you can see that first response came in after 210 ms, followed by 3 responses with near 0 elapsed time in between. The cycle is repeated again and again. This is where you should be using parallel flux. Note that creating more number of threads doesn't guarantee performance because when there are more number of threads then context switching adds alot of overhead and hence the code should be tested well before deployment. If there are alot of blocking calls, having more than 1 number of thread per cpu may give you a performance boost but if the calls made are cpu intensive then having more than one thread per cpu will slow down the performance due to context switching.
So all in all, it always depends on what you want to achieve.

Worth stating I'm assuming the context here is Webflux rather than reactor in general (since the question is tagged as such.) The recommendations will can of course vary wildly if we're talking about a generalised reactor use case without considering Webflux.
For a fully non-blocking end to end reactive calls, is it recommended to explicitly call publishOn or subscribeOn to switch schedulers?
The general recommendation is not to explicitly call those methods unless you have a reason to do so. (There's nothing wrong with using them in the correct context, but doing so "just because" will likely bring no benefit.)
For either cpu consuming or non consuming tasks, is it favorable to always use parallel flux to optimize performance?
It depends what you're aiming to achieve, and what you mean by "CPU consuming" (or CPU intensive) tasks. Note that here I'm talking about genuinely CPU intensive tasks rather than blocking code - and in this case, I'd ideally farm the CPU intensive part out to another microservice, enabling you to scale that as required, separately from your Webflux service.
Using a parallel flux (and running it on the parallel scheduler) should use all available cores to process data - which may well result in it being processed faster. But bear in mind you also have, by default, an event loop running for each core, so you've essentially "stolen" some available capacity from the event loop in order to achieve that. Whether that is ideal depends on your use-case, but usually it won't bring much benefit.
Instead, there's two approaches I'd recommend:
If you can break your CPU intensive task down into small, low intensity chunks, do so - and then you can keep it on the event loop. This allows the event loop to keep running in a timely fashion while these CPU intensive tasks are scheduled as they're able to be.
If you can't break it down, spin up a separate Scheduler (optionally with a low priority so it's less likely to steal resources from the event loop), and then farm all your CPU intensive tasks out to that. This has the disadvantage of creating a bunch more threads, but again keeps the event loop free. By default you'll have as many threads as there are cores for the event loop - you may wish to reduce that in order to give your "CPU intensive" scheduler more cores to play with.

Related

Splitting work between multiple pollers?

My current setup for work on server side is like this -- I have a manager (with poller) which waits for incoming requests for work to do. Once something is received it creates worker (with separate poller, and separate ports/sockets) for the job, and further on worker communicates directly with client.
What I observe that when there is some intense traffic with any of the worker it disables manager somewhat -- ReceiveReady events are fired with significant delays.
NetMQ documentation states "Receiving messages with poller is slower than directly calling Receive method on the socket. When handling thousands of messages a second, or more, poller can be a bottleneck." I am so far below this limit (say 100 messages in a row) but I wonder whether having multiple pollers in single program does not clip performance even further.
I prefer having separate instances because the code is cleaner (separation of concerns), but maybe I am going against the principles of ZeroMQ? The question is -- is using multiple pollers in single program performance wise? Or in reverse -- do multiple pollers starve each other by design?

Professional system analysis may even require you to run multiple Poller() instances:
Design system based on facts and requirements, rather than to listen to some popularised opinions.
Implement performance benchmarks and measure details about actual implementation. Comparing facts against thresholds is a.k.a. a Fact-Based-Decision.
If not hunting for the last few hundreds of [ns], a typical scenario may look this way:
your core logic inside an event-responding loop is to handle several classes of ZeroMQ integrated signallin / messaging inputs/outputs, all in a principally non-blocking mode plus your design has to spend specific amount of relative-attention to each such class.
One may accept some higher inter-process latencies for a remote-keyboard ( running a CLI-interface "across" a network, while your event-loop has to meet a strict requirement not to miss any "fresh" update from a QUOTE-stream. So one has to create a light-weight Real-Time-SCHEDULER logic, that will introduce one high-priority Poller() for non-blocking ( zero-wait ), another one with ~ 5 ms test on reading "slow"-channels and another one with a 15 ms test on reading the main data-flow pipe. If you have profiled your event-handling routines not to last more than 5 ms worst case, you still can handle TAT of 25 ms and your event-loop may handle systems with a requirement to have a stable control-loop cycle of 40 Hz.
Not using a set of several "specialised" pollers will not allow one to get this level of scheduling determinism with an easily expressed core-logic to integrate in such principally stable control-loops.
Q.E.D.
I use similar design so as to drive heterogeneous distributed systems for FOREX trading, based on external AI/ML-predictors, where transaction times are kept under ~ 70 ms ( end-to-end TAT, from a QUOTE arrival to an AI/ML advised XTO order-instruction being submitted ) right due to a need to match the real-time constraints of the control-loop scheduling requirements.
Epilogue:
If the documentation says something about a poller performance, in the ranges above 1 kHz signal delivery, but does not mention anything about a duration of a signal/message handling-process, it does a poor service for the public.
The first step to take is to measure the process latencies, next, analyse the performance envelopes. All ZeroMQ tools are designed to scale, so has the application infrastructure -- so forget about any SLOC-sized examples, the bottleneck is not the poller instance, but a poor application use of the available ZeroMQ components ( given a known performance envelope was taken into account ) -- one can always increase the overall processing capacity available, with ZeroMQ we are in a distributed-systems realm from a Day 0, aren't we?
So in concisely designed + monitored + adaptively scaled systems no choking will appear.

Benefits of green threads vs a simple loop

Is there any benefit to using green threads / lightweight threads over a simple loop or sequential code, assuming only non blocking operations are used in both?
for i := 0; i < 5; i++ {
go doSomethingExpensive() // using golang example
}
// versus
for i := 0; i < 5; i++ {
doSomethingExpensive()
}
As far as I can think of
- green threads help avoid a bit of callback hell on async operations
- allow scheduling of M green threads on N kernel threads
But
- add a bit of complexity and performance requiring a scheduler
- easier cross thread communication when the language supports it and the execution was split to different cpu's (otherwise sequential code is simpler)

No, the green threads have no performance benefits at all.
If the threads are performing non-blocking operations:
Multiple threads have no benefits if you have only one physical core (since the same core has to execute everything, threads only makes things slower because of an overhead)
Up to as many threads as CPU cores you have have a performance benefit, since multiple cores can execute your threads physically parallel (see Play! framework)
Green threads have no benefits, since they are running from the same one real thread by a sub-scheduler, so actually green threads == 1 thread
If the threads are performing blocking operations, things may look different:
multiple threads makes sense, since one thread can be blocked, but the others can go on, so blocking slows down only one thread
you can avoid the callback-hell by just implementing your partially blocking process as one thread. Since you're free to block from one thread while e.g. waiting for IO, you get much simpler code.
Green threads
Green threads are not real threads by design, so they won't be split amongst multiple CPUs and are not indended to work in parallel. This can give a false understading that you can avoid synchronization - however once you upgrade to real threads the lack of proper synchronization will introduce a good set of issues.
Green threads were widely used in early Java days, when the JVM did not support real OS threads. A variant of green threads, called Fibers are part of the Windows operating system, and e.g. the MS SQL server uses them heavily to handle various blocking scenarios without the heavy overhead of using real threads.
You can choose not only amongst green threads and real threads, but may also consider continuations (https://www.playframework.com/documentation/1.3.x/asynchronous)
Continuations give you the best of both worlds:
your code logically looks like if it is a linear code, no callback hells
in reality the code is executed by real threads, however if a thread is getting blocked it suspends its execution and can switch to executing other code. Once the blocking condition signals, the thread can switch back and continue your code.
This approach is quite resource friendly. Play! framework uses as many threads as CPU cores you have (4-8) but beats all high-end Java application servers in terms of performance.

Goroutines 8kb and windows OS thread 1 mb

As windows user, I know that OS threads consume ~1 Mb of memory due to By default, Windows allocates 1 MB of memory for each thread’s user-mode stack. How does golang use ~8kb of memory for each goroutine, if OS thread is much more gluttonous. Are goroutine sort of virtual threads?

Goroutines are not threads, they are (from the spec):
...an independent concurrent thread of control, or goroutine, within the same address space.
Effective Go defines them as:
They're called goroutines because the existing terms—threads, coroutines, processes, and so on—convey inaccurate connotations. A goroutine has a simple model: it is a function executing concurrently with other goroutines in the same address space. It is lightweight, costing little more than the allocation of stack space. And the stacks start small, so they are cheap, and grow by allocating (and freeing) heap storage as required.
Goroutines don't have their own threads. Instead multiple goroutines are (may be) multiplexed onto the same OS threads so if one should block (e.g. waiting for I/O or a blocking channel operation), others continue to run.
The actual number of threads executing goroutines simultaneously can be set with the runtime.GOMAXPROCS() function. Quoting from the runtime package documentation:
The GOMAXPROCS variable limits the number of operating system threads that can execute user-level Go code simultaneously. There is no limit to the number of threads that can be blocked in system calls on behalf of Go code; those do not count against the GOMAXPROCS limit.
Note that in current implementation by default only 1 thread is used to execute goroutines.

1 MiB is the default, as you correctly noted. You can pick your own stack size easily (however, the minimum is still a lot higher than ~8 kiB).
That said, goroutines aren't threads. They're just tasks with coöperative multi-tasking, similar to Python's. The goroutine itself is just the code and data required to do what you want; there's also a separate scheduler (which runs on one on more OS threads), which actually executes that code.
In pseudo-code:
loop forever
take job from queue
execute job
end loop
Of course, the execute job part can be very simple, or very complicated. The simplest thing you can do is just execute a given delegate (if your language supports something like that). In effect, this is simply a method call. In more complicated scenarios, there can be also stuff like restoring some kind of context, handling continuations and coöperative task yields, for example.
This is a very light-weight approach, and very useful when doing asynchronous programming (which is almost everything nowadays :)). Many languages now support something similar - Python is the first one I've seen with this ("tasklets"), long before go. Of course, in an environment without pre-emptive multi-threading, this was pretty much the default.
In C#, for example, there's Tasks. They're not entirely the same as goroutines, but in practice, they come pretty close - the main difference being that Tasks use threads from the thread pool (usually), rather than a separate dedicated "scheduler" threads. This means that if you start 1000 tasks, it is possible for them to be run by 1000 separate threads; in practice, it would require you to write very bad Task code (e.g. using only blocking I/O, sleeping threads, waiting on wait handles etc.). If you use Tasks for asynchronous non-blocking I/O and CPU work, they come pretty close to goroutines - in actual practice. The theory is a bit different :)
EDIT:
To clear up some confusion, here is how a typical C# asynchronous method might look like:
async Task<string> GetData()
{
var html = await HttpClient.GetAsync("http://www.google.com");
var parsedStructure = Parse(html);
var dbData = await DataLayer.GetSomeStuffAsync(parsedStructure.ElementId);
return dbData.First().Description;
}
From point of view of the GetData method, the entire processing is synchronous - it's just as if you didn't use the asynchronous methods at all. The crucial difference is that you're not using up threads while you're doing the "waiting"; but ignoring that, it's almost exactly the same as writing synchronous blocking code. This also applies to any issues with shared state, of course - there isn't much of a difference between multi-threading issues in await and in blocking multi-threaded I/O. It's easier to avoid with Tasks, but just because of the tools you have, not because of any "magic" that Tasks do.
The main difference from goroutines in this aspect is that Go doesn't really have blocking methods in the usual sense of the word. Instead of blocking, they queue their particular asynchronous request, and yield. When the OS (and any other layers in Go - I don't have deep knowledge about the inner workings) receives the response, it posts it to the goroutine scheduler, which in turns knows that the goroutine that "waits" for the response is now ready to resume execution; when it actually gets a slot, it will continue on from the "blocking" call as if it had really been blocking - but in effect, it's very similar to what C#'s await does. There's no fundamental difference - there's quite a few differences between C#'s approach and Go's, but they're not all that huge.
And also note that this is fundamentally the same approach used on old Windows systems without pre-emptive multi-tasking - any "blocking" method would simply yield the thread's execution back to the scheduler. Of course, on those systems, you only had a single CPU core, so you couldn't execute multiple threads at once, but the principle is still the same.

goroutines are what we call green threads. They are not OS threads, the go scheduler is responsible for them. This is why they can have much smaller memory footprints.

Is it advisable to use the Windows API 'SetThreadPriority' within a parallel_for_each loop

I would like to reduce the thread-priority of the threads servicing a parallel_for_each, because under heavy load conditions they consume too much processor time relative to other threads in my system.
Questions:
1) Do the servicing threads of a parallel_for_each inherit the thread-priority of the calling thread? In this case I could presumably call SetThreadPriority before and after the parallel_for_each, and everything should be fine.
2) Alternatively is it advisable to call SetThreadPriority within the parallel_for_each? This will clearly invoke the API multiple times for the same threads. Is there a large overhead of doing this?
2.b) Assuming that I do this, will it affect thread-priorities the next time that parallel_for_each is called - ie do I need to somehow reset the priority of each thread afterwards?
3) I'm wondering about thread-priorities in general. Would anyone like to comment: supposing that I had 2 threads contending for a single processor and one was set to "below-normal" while the other was "normal" priority. Roughly what percentage more processor time would the one thread get compared to the other?

All threads initially start at THREAD_PRIORITY_NORMAL. So you'd have to reduce the priority of each thread. Or reduce the priority of the owning process.
There is little overhead in calling SetThreadPriority. Once you have woken up a thread, the additional cost of calling SetThreadPriority is negligible. Once you set the thread's priority it will remain at that value until changed.
Suppose that you have one processor, and two threads ready to run. The scheduler will always choose to run the thread with the higher priority. This means that in your scenario, the below normal threads would never run. In reality, there's a lot more to scheduling than that. For example priority inversion. However, you can think of it like this. If all processors are busy with normal priority threads, then expect lower priority threads to be starved of CPU.

Multiple threads and performance on a single CPU

Is here any performance benefit to using multiple threads on a computer with a single CPU that does not having hyperthreading?

In terms of speed of computation, No. In fact things will slow down due to the overhead of managing the threads.
In terms of responsiveness, yes. You can for example have one thread wait on an IO operation and have another run a GUI at the same time.

It depends on your application. If it spends all its time using the CPU, then multithreading will just slow things down - though you may be able to use it to be more responsive to the user and thus give the impression of better performance.
However, if your code is limited by other things, for example using the file system, the network, or any other resource, then multithreading can help, since it allows your application to behave asynchronously. So while one thread is waiting for a file to load from disk, another can be querying a remote webserver and another redrawing the GUI, while another is doing various calculations.
Working with multiple threads can also simplify your business logic, since you don't have to pay so much attention to how various independent tasks need to interleave. If the operating system's scheduling logic is better than yours, then you may indeed see improved performance.

You can consider using multithreading on a single CPU
If you use network resources
If you do high-intensive IO operations
If you pull data from a database
If you exploit other stuff with possible delays
If you want to make your app with ultraspeed reaction
When you should not use multithreading on a single CPU
High-intensive operations which do almost 100% CPU usage
If you are not sure how to use threads and synchronization
If your application cannot be divided into several parallel processes

Yes, there is a benefit of using multiple threads (or processes) on a single CPU - if one thread is busy waiting for something, others can continue doing useful work.
However this can be offset by the overhead of task switching. You'll have to benchmark and/or profile your application on production grade hardware to find out.

Regardless of the number of CPUs available, if you require preemptive multitasking and/or applications with asynchronous components (i.e. pretty much anything that combines a responsive GUI with a non-trivial amount of computation or continuous I/O processing), multithreading performs much better than the alternative, which is to use multiple processes for each application.
This is because threads in the same process can exchange data much more efficiently than can multiple processes, because they share the same memory context.
See this Wikipedia article on computer multitasking for a fairly concise discussion of these issues.

Absolutely! If you do any kind of I/O, there is great advantage to having a multithreaded system. While one thread wait for an I/O operation (which are relatively slow), another thread can do useful work.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio