how to measure python asyncio event loop metrics?

how to measure python asyncio event loop metrics? - python-asyncio

Is there a module to measure asyncio event loop metrics? or for asyncio event loop, what metrics we should monitor for performance analysis purpose?
e.g.
how many tasks in the event loop?
how many tasks in waiting states?
I'm not trying to measure the coroutine functions. aiomonitor has the functionality, but is not exactly what I need.

I hardly believe number of pending tasks or tasks summary will tell you much. Let's say you have 10000 tasks, 8000 of them pending: is it much, is it not? Who knows.
Thing is - each asyncio task (or any other Python object) can consume different amount of different machine resources.
Instead of trying to monitor asyncio specific objects I think it's better to monitor general metrics:
CPU usage
RAM usage
Network I/O (in case you're dealing with it)
Hard drive I/O (in case you're dealing with it)
What comes to asyncio you should probably always use asyncio.Semaphore to limit max number of currently running jobs and implement a convenient way to change value of semaphore(s) (for example, through config file).
It'll allow to alter workload on a concrete machine depending on its available and actually utilized resources.
Upd:
My question, will asyncio still accept new connections during this
block?
If your event loop is blocked by some CPU calculation, asyncio will start to process new connections later - when event loop is free again (if they're not time-outed at this moment).
You should always avoid situation of freezing event loop. Freezed somewhere event loop means that all tasks everywhere in code are freezed also! Any kind of loop freezing breaks whole idea of using asynchronous approach regardless of number of tasks. Any kind of code, where event loop is freezed will have performance issues.
As you noted, you can use ProcessPoolExecutor with run_in_executor to await for CPU-bound stuff, but you can use ThreadPoolExecutor to avoid freezing as well.

Related

Why Golang scheduler uses two Queues (global run queue and local run queue) to manage goroutine?

I was reading how Golang internally manages new created goroutine in the application. And I come to know runtime scheduler use to queue to manage the created goroutines.
Global run queue: All newly created goroutine is placed to this queue.
Local run queue: All go routine which is about to run is allocated to local run queue and from there scheduler will assign it to OS thread.
So, Here my question is why scheduler is using two queues to manage goroutine. Why can't they just use global run queue and from there scheduler will map it to OS thread.

First, please note that this blog is an unofficial and old source, so the information in it shouldn't be taken as totally accurate with respect to the current version of Go (or any version, for that matter). You can still learn from it, but the Go scheduler is improved over time, which can make information out of date. For example, the blog says "Go scheduler is not a preemptive scheduler but a cooperating scheduler". As of Go 1.14, this is no longer true as preemption was added to the runtime. As for the other information, I won't vouch for it's accuracy, but here's an explanation of what they say.
Reading the blog post:
There are two different run queues in the Go scheduler: the Global Run Queue (GRQ) and the Local Run Queue (LRQ). Each P is given a LRQ that manages the Goroutines assigned to be executed within the context of a P. These Goroutines take turns being context-switched on and off the M assigned to that P. The GRQ is for Goroutines that have not been assigned to a P yet. There is a process to move Goroutines from the GRQ to a LRQ that we will discuss later.
This means the GRQ is for Goroutines that haven't been assigned to run yet, the LRQ is for Goroutines that have been assigned to a P to run or have already begun executing. Each Goroutine will start on the GRQ, and join a LRQ later to begin executing.
Here is the process that the previous quote was referencing, where Goroutines are moved from the GRQ to LRQ:
In figure 10, P1 has no more Goroutines to execute. But there are Goroutines in a runnable state, both in the LRQ for P2 and in the GRQ. This is a moment where P1 needs to steal work. The rules for stealing work are as follows.
runtime.schedule() {
// only 1/61 of the time, check the global runnable queue for a G.
// if not found, check the local queue.
// if not found,
// try to steal from other Ps.
// if not, check the global runnable queue.
// if not found, poll network.
}
This means a P will prioritize running goroutines in their own LRQ, then from other P's LRQ, then from the GRQ, then from network polling. There is also a small chance to immediately run a Goroutine from the GRQ immediately. By having multiple queues, it allows this priority system to be constructed.
Why do we want priority in which goroutines get run? It may have various performance benefits. For example, it could make better use of the CPU cache. If you run a Goroutine that was already running recently, it's more likely that the data it's working with is still in the CPU cache, making it fast to access. When you start up a new Goroutine, it may use or create data that isn't in the cache yet. That data will then enter the cache and could evict the data being used by another Goroutine, which in turn causes that Goroutine to be slower when it resumes again. In the pathological case, this is called cache thrashing, and greatly reduces the effective speed of the processor.
Allowing the CPU cache to work effectively can be one of the most important factors in achieving high performance on modern processors, but it's not the only reason to have such a queue system. In general, the more logical processes that are running at the same time (such as Goroutines in a Go program), the more resource contention will occur. This is because the resources used by a process tend to be fairly stable over the runtime of the process. In other words, every time you start a new process tends to increase the overall resource load, while continuing an already started process tends to maintain the resource load, and finishing a process tends to reduce the resource load. Therefore, prioritizing already running processes over new processes would tend to help keep the resource load in a manageable range.
It's analogous to the practical advice of "finish what you started". If you have a lot of tasks to accomplish, it's more effective to complete them one at a time, or multitask just a handful of things if you can. If you just keep starting new tasks and never finished the previous ones, eventually you have so many things going on at the same time that you feel overwhelmed.

Does the Windows scheduler sometimes fail to preempt a running thread immediately to let a higher-priority thread run?

My application operates on pairs of long vectors - say it adds them together to produce a vector result. Its rules state that it must completely finish with one pair before it can be given another. I would like to use multiple threads to speed things up. I am running Windows 10.
I created an OpenMP parallel for construct and divided the vector among all the threads of the team. All threads start, all threads run pretty fast, so the multithreading is effective.
But the speedup is slight, and the reason is that some of the time, one of the worker threads takes way longer than usual. I have instrumented the operation, and I see that sometimes the worker threads take a long time to start - delay varies from 20 microseconds on average to dozens of milliseconds depending on system load. The master thread does not show this delay.
That makes me think that the scheduler is taking some time to start the worker threads. The master thread is already running, so it doesn't have to wait to be started.
But here is the nub of the question: raising the priority of the process doesn't make any difference. I can raise it to high priority or even realtime priority, and I still see that startup of the worker threads is often delayed. It looks like the Windows scheduler is not fully preemptive, and sometimes lets a lower-priority thread run when a higher-priority one is eligible. Can anyone confirm this?
I have verified that the worker threads are created with the default OS priority, namely the base priority of the class of the master process. This should be higher that the priority of any running thread, I think. Or is it normal for there to be some thread with realtime priority that might be blocking my workers? I don't see one with Task Manager.
I guess one last possibility is that the task switch might take 20 usec. Is that plausible?
I have a 4-core system without hyperthreading.

Difference between boundedElastic() vs parallel() scheduler

I'm new to Project reactor and trying to understand difference between boundedElastic() vs parallel() scheduler. Documentation says that boundedElastic() is used for blocking tasks and parallel() for non-blocking tasks.
Why do Project reactor need to address blocking scenario as they are non-blocking in nature. Can someone please help me out with some real world use case for boundedElastic() vs parallel() scheduler
?

The parallel flavor is backed by N workers (according to the N cpus) each based on a ScheduledExecutorService. If you submit N long lived tasks to it, no more work can be executed, hence the affinity for short-lived tasks.
The elastic flavor is also backed by workers based on ScheduledExecutorService, except it creates these workers on demand and pools them.
BoundedElastic is same as elastic, difference is that you can limit the total no. of threads.
https://spring.io/blog/2019/12/13/flight-of-the-flux-3-hopping-threads-and-schedulers

TL;DR
Reactor executes non-blocking/async tasks on a small number of threads. In case task is blocking - thread would be blocked and all other tasks would be waiting for it.
parallel should be used for fast non-blocking operation (default option)
boundedElastic should be used to "offload" blocking tasks
In general Reactor API is concurrency-agnostic that use Schedulers abstraction to execute tasks. Schedulers have responsibilities very similar to ExecutorService.
Schedulers.parallel()
Should be a default option and used for fast non-blocking operation on a small number of threads. By default, number of threads is equal to number of CPU cores. It could be controlled by reactor.schedulers.defaultPoolSize system property.
Schedulers.boundedElastic()
Used to execute longer operations (blocking tasks) as a part of the reactive flow. It will use thread pool with a default number of threads number of CPU cores x 10 (could be controlled by reactor.schedulers.defaultBoundedElasticSize) and default queue size of 100000 per thread (reactor.schedulers.defaultBoundedElasticSize).
subscribeOn or publishOn could be used to change the scheduler.
The following code shows how to wrap blocking operation
Mono.fromCallable(() -> {
// blocking operation
}).subscribeOn(Schedulers.boundedElastic()); // run on a separate scheduler because code is blocking
Schedulers.newBoundedElastic()
Similar to Schedulers.boundedElastic() but is useful when you need to create a separate thread pool for some operation.
Sometimes it's not obvious what code is blocking. One very useful tool while testing reactive code is BlockHound

Schedulers provides various Scheduler flavors usable by publishOn or subscribeOn :
1)parallel(): Optimized for fast Runnable non-blocking executions
2)single(): Optimized for low-latency Runnable one-off executions
3)elastic(): Optimized for longer executions, an alternative for blocking tasks where the number of active tasks (and threads) can grow indefinitely
4)boundedElastic(): Optimized for longer executions, an alternative for
fromExecutorService(ExecutorService) to create new instances around Executors
https://projectreactor.io/docs/core/release/api/reactor/core/scheduler/Schedulers.html

Is there a way to limit the number of concurrent Promise evaluations with Parse.Promise.when()?

I'm trying to get a bit of parallelism in my app to decrease the amount of time some operations take. I noticed that Parse.Promise.when() seems to evaluate promises in parallel. But there seems to be no limit to how many promises it tries to evaluate in parallel, is that right?
In this particular example I'm trying to do something to 1500 records. If I use .when, it looks like it's trying to make 1500 connections to the parse api and it seems to be failing somewhere. But when I do these 1500 operations in series, it seems to take forever.
How do you guys deal with this kind of problem?
One way I thought of to deal with this kind of problem might be to modify Parse.Promise.when() so that when I call it, I can specify the level of parallelism I need. e.g. Parse.Promise.when(promises, 10)
Thanks

No, there is not. when does not "evaluate" or "call" promises, it just waits for already existing promises whose tasks are already running since you created them. It's the same as for Promise.all.
Have a look at Limit concurrency of promise being run on how to deal with calling an asynchronous function multiple times.

Why have multiple threads on a server

I am creating an server to send data to many persistent sockets. I have chosen the REACTOR design pattern which suggests having multiple threads to send data to along sockets.
I cannot understand what is better:
- To have one thread to send all of the data to sockets
- Or have a couple of threads to send data across the sockets.
The way I see it is that I have 2 cores. So I can only do two things at once. Whcih would mean I have 1 worker thread and 1 thread to send data?
Why would it be better to have mulitple threads to send data when you suffer context switching between the threads?

See documentation on thttpd as to why single threaded non blocking IO is good. Indeed it makes good sense for static files.
If you are doing CGI however, you may have a script that runs for a long time. It's nicer to not hold up all the quicker simpler traffic, especially if the script has an infinite-loop bug in it and is to eventually be killed anyway! With threads the average response time experienced by users will be better - if some of the requests are very time consuming.
If the files being served come from disk and are not in main memory already, a similar argument can be used.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio