Is out of order command queue useful on AMD GPU?

Is out of order command queue useful on AMD GPU? - parallel-processing

It seems to me that one opencl command queue won't dispatch commands to more than one hardware queue. So commands in an out of order command queue are still executed one by one, just not in the order they were enqueued?
So if I want to make use of multiple hardware queues all I can do is to create multiple opencl command queues?

OOO (out of order) queues are available to meet the needs of user event dependency. Having a single queue in this type of applications can lead to a blocked queue waiting to a user event that never comes. And creating one queue per job is also non optimal.
If you want parallelism int the execution, OOO is NOT what you need. But multiple queues.
A common approach is to use a Queue for IO, and a queue for running kernels.
But you can also use a queue per thread, in a multi-thread processing scheme. IO of each thread will overlap the execution of other threads.
NOTE: nVIDIA does support parallel execution of jobs in a single queue, but that is out of the standard.

Related

Why Golang scheduler uses two Queues (global run queue and local run queue) to manage goroutine?

I was reading how Golang internally manages new created goroutine in the application. And I come to know runtime scheduler use to queue to manage the created goroutines.
Global run queue: All newly created goroutine is placed to this queue.
Local run queue: All go routine which is about to run is allocated to local run queue and from there scheduler will assign it to OS thread.
So, Here my question is why scheduler is using two queues to manage goroutine. Why can't they just use global run queue and from there scheduler will map it to OS thread.

First, please note that this blog is an unofficial and old source, so the information in it shouldn't be taken as totally accurate with respect to the current version of Go (or any version, for that matter). You can still learn from it, but the Go scheduler is improved over time, which can make information out of date. For example, the blog says "Go scheduler is not a preemptive scheduler but a cooperating scheduler". As of Go 1.14, this is no longer true as preemption was added to the runtime. As for the other information, I won't vouch for it's accuracy, but here's an explanation of what they say.
Reading the blog post:
There are two different run queues in the Go scheduler: the Global Run Queue (GRQ) and the Local Run Queue (LRQ). Each P is given a LRQ that manages the Goroutines assigned to be executed within the context of a P. These Goroutines take turns being context-switched on and off the M assigned to that P. The GRQ is for Goroutines that have not been assigned to a P yet. There is a process to move Goroutines from the GRQ to a LRQ that we will discuss later.
This means the GRQ is for Goroutines that haven't been assigned to run yet, the LRQ is for Goroutines that have been assigned to a P to run or have already begun executing. Each Goroutine will start on the GRQ, and join a LRQ later to begin executing.
Here is the process that the previous quote was referencing, where Goroutines are moved from the GRQ to LRQ:
In figure 10, P1 has no more Goroutines to execute. But there are Goroutines in a runnable state, both in the LRQ for P2 and in the GRQ. This is a moment where P1 needs to steal work. The rules for stealing work are as follows.
runtime.schedule() {
// only 1/61 of the time, check the global runnable queue for a G.
// if not found, check the local queue.
// if not found,
// try to steal from other Ps.
// if not, check the global runnable queue.
// if not found, poll network.
}
This means a P will prioritize running goroutines in their own LRQ, then from other P's LRQ, then from the GRQ, then from network polling. There is also a small chance to immediately run a Goroutine from the GRQ immediately. By having multiple queues, it allows this priority system to be constructed.
Why do we want priority in which goroutines get run? It may have various performance benefits. For example, it could make better use of the CPU cache. If you run a Goroutine that was already running recently, it's more likely that the data it's working with is still in the CPU cache, making it fast to access. When you start up a new Goroutine, it may use or create data that isn't in the cache yet. That data will then enter the cache and could evict the data being used by another Goroutine, which in turn causes that Goroutine to be slower when it resumes again. In the pathological case, this is called cache thrashing, and greatly reduces the effective speed of the processor.
Allowing the CPU cache to work effectively can be one of the most important factors in achieving high performance on modern processors, but it's not the only reason to have such a queue system. In general, the more logical processes that are running at the same time (such as Goroutines in a Go program), the more resource contention will occur. This is because the resources used by a process tend to be fairly stable over the runtime of the process. In other words, every time you start a new process tends to increase the overall resource load, while continuing an already started process tends to maintain the resource load, and finishing a process tends to reduce the resource load. Therefore, prioritizing already running processes over new processes would tend to help keep the resource load in a manageable range.
It's analogous to the practical advice of "finish what you started". If you have a lot of tasks to accomplish, it's more effective to complete them one at a time, or multitask just a handful of things if you can. If you just keep starting new tasks and never finished the previous ones, eventually you have so many things going on at the same time that you feel overwhelmed.

Difference between boundedElastic() vs parallel() scheduler

I'm new to Project reactor and trying to understand difference between boundedElastic() vs parallel() scheduler. Documentation says that boundedElastic() is used for blocking tasks and parallel() for non-blocking tasks.
Why do Project reactor need to address blocking scenario as they are non-blocking in nature. Can someone please help me out with some real world use case for boundedElastic() vs parallel() scheduler
?

The parallel flavor is backed by N workers (according to the N cpus) each based on a ScheduledExecutorService. If you submit N long lived tasks to it, no more work can be executed, hence the affinity for short-lived tasks.
The elastic flavor is also backed by workers based on ScheduledExecutorService, except it creates these workers on demand and pools them.
BoundedElastic is same as elastic, difference is that you can limit the total no. of threads.
https://spring.io/blog/2019/12/13/flight-of-the-flux-3-hopping-threads-and-schedulers

TL;DR
Reactor executes non-blocking/async tasks on a small number of threads. In case task is blocking - thread would be blocked and all other tasks would be waiting for it.
parallel should be used for fast non-blocking operation (default option)
boundedElastic should be used to "offload" blocking tasks
In general Reactor API is concurrency-agnostic that use Schedulers abstraction to execute tasks. Schedulers have responsibilities very similar to ExecutorService.
Schedulers.parallel()
Should be a default option and used for fast non-blocking operation on a small number of threads. By default, number of threads is equal to number of CPU cores. It could be controlled by reactor.schedulers.defaultPoolSize system property.
Schedulers.boundedElastic()
Used to execute longer operations (blocking tasks) as a part of the reactive flow. It will use thread pool with a default number of threads number of CPU cores x 10 (could be controlled by reactor.schedulers.defaultBoundedElasticSize) and default queue size of 100000 per thread (reactor.schedulers.defaultBoundedElasticSize).
subscribeOn or publishOn could be used to change the scheduler.
The following code shows how to wrap blocking operation
Mono.fromCallable(() -> {
// blocking operation
}).subscribeOn(Schedulers.boundedElastic()); // run on a separate scheduler because code is blocking
Schedulers.newBoundedElastic()
Similar to Schedulers.boundedElastic() but is useful when you need to create a separate thread pool for some operation.
Sometimes it's not obvious what code is blocking. One very useful tool while testing reactive code is BlockHound

Schedulers provides various Scheduler flavors usable by publishOn or subscribeOn :
1)parallel(): Optimized for fast Runnable non-blocking executions
2)single(): Optimized for low-latency Runnable one-off executions
3)elastic(): Optimized for longer executions, an alternative for blocking tasks where the number of active tasks (and threads) can grow indefinitely
4)boundedElastic(): Optimized for longer executions, an alternative for
fromExecutorService(ExecutorService) to create new instances around Executors
https://projectreactor.io/docs/core/release/api/reactor/core/scheduler/Schedulers.html

Optimal size of worker pool

I'm building a Go app which uses a "worker pool" of goroutines, initially I start the pool creating a number of workers. I was wondering what would be the optimal number of workers in a mult-core processor, for example in a CPU with 4 cores ? I'm currently using the following aproach:
// init pool
numCPUs := runtime.NumCPU()
runtime.GOMAXPROCS(numCPUs + 1) // numCPUs hot threads + one for async tasks.
maxWorkers := numCPUs * 4
jobQueue := make(chan job.Job)
module := Module{
Dispatcher: job.NewWorkerPool(maxWorkers),
JobQueue: jobQueue,
Router: router,
}
// A buffered channel that we can send work requests on.
module.Dispatcher.Run(jobQueue)
The complete implementation is under
job.NewWorkerPool(maxWorkers)
and
module.Dispatcher.Run(jobQueue)
My use-case for using a worker pool: I have a service which accepts requests and calls multiple external APIs and aggregate their results into a single response. Each call can be done independently from others as the order of results doesn't matter. I dispatch the calls to the worker pool where each call is done in one available goroutine in an asynchronous way. My "request" thread keeps listening to the return channels while fetching and aggregating results as soon as a worker thread is done. When all are done the final aggregated result is returned as a response. Since each external API call may render variable response times some calls can be completed earlier than others. As per my understanding doing it in a parallel way would be better in terms of performance as if compared to doing it in a synchronous way calling each external API one after another

The comments in your sample code suggest you may be conflating the two concepts of GOMAXPROCS and a worker pool. These two concepts are completely distinct in Go.
GOMAXPROCS sets the maximum number of CPU threads the Go runtime will use. This defaults to the number of CPU cores found on the system, and should almost never be changed. The only time I can think of to change this would be if you wanted to explicitly limit a Go program to use fewer than the available CPUs for some reason, then you might set this to 1, for example, even when running on a 4-core CPU. This should only ever matter in rare situations.
TL;DR; Never set runtime.GOMAXPROCS manually.
Worker pools in Go are a set of goroutines, which handle jobs as they arrive. There are different ways of handling worker pools in Go.
What number of workers should you use? There is no objective answer. Probably the only way to know is to benchmark various configurations until you find one that meets your requirements.
As a simple case, suppose your worker pool is doing something very CPU-intensive. In this case, you probably want one worker per CPU.
As a more likely example, though, lets say your workers are doing something more I/O bound--such as reading HTTP requests, or sending email via SMTP. In this case, you may reasonably handle dozens or even thousands of workers per CPU.
And then there's also the question of if you even should use a worker pool. Most problems in Go do not require worker pools at all. I've worked on dozens of production Go programs, and never once used a worker pool in any of them. I've also written many times more one-time-use Go tools, and only used a worker pool maybe once.
And finally, the only way in which GOMAXPROCS and worker pools relate is the same as how goroutines relates to GOMAXPROCS. From the docs:
The GOMAXPROCS variable limits the number of operating system threads that can execute user-level Go code simultaneously. There is no limit to the number of threads that can be blocked in system calls on behalf of Go code; those do not count against the GOMAXPROCS limit. This package's GOMAXPROCS function queries and changes the limit.
From this simple description, it's easy to see that there could be many more (potentially hundreds of thousands... or more) goroutines than GOMAXPROCS--GOMAXPROCS only limits how many "operating system threads that can execute user-level Go code simultaneously"--goroutines which aren't executing user-level Go code at the moment don't count. And in I/O-bound goroutines (such as those waiting for a network response) aren't executing code. So you have a theoretical maximum number of goroutines limited only by your system's available memory.

Multiple queues vs multiple jobs in resque

I am using resque to background process two types of jobs:
(1) 3rd-party API requests
(2) DB query and insert
While the two jobs can be processed parallely, each job type in itself can only be processed in serial order. For example, DB operations need to happen in serial order but can be executed in parallel with 3rd party API requests.
I am contemplating either of the following methods for executing this :
(1) Having two queues with one queue handling only API requests and the other queue
handling only db queries. Each queue will have its own worker.
(2) One single queue but two workers. One worker for each job.
I would like to know the difference in the two approaches and which among the two would be a better approach to take.

This choice of selecting an architecture is not straight forward, you have to keep many things in mind.
Having two queues with one queue handling only API requests and the other queue
handling only db queries. Each queue will have its own worker.
Ans: You can have this architecture when you have both the queues equally busy i.e if you have one queue with more number of jobs and other empty then your one worker will be idol and there will be jobs waiting in other queue.
You should always think about full utilisation of your workers.
One single queue but two workers. One worker for each job.
Ans: This approach is what we also use in our project.
Having all jobs en-queued to one queue and have multiple workers running on it.
Your workers will always be busy no matter which type job is present.
proper worker utilisation is possible.
At last I would suggest you can use 2nd approach.

Are there any tools to optimize the number of consumer and producer threads on a JMS queue?

I'm working on an application that is distributed over two JBoss instances and that produces/consumes JMS messages on several JMS queues.
When we configured the application we had to determine which threading model we would use, in particular the number of producing and consuming threads per queue. We have done this in a rather ad-hoc fashion but after reading the most recent columns by Herb Sutter in Dr Dobbs (in particular this one) I would like to size our threads in a more rigorous manner.
Are there any methods/tools to measure the throughput of JMS queues (in particular JBoss Messaging queues) as a function of the number of producing/consuming threads?

This is not really about a specific tool, but may be helpful.
Consumers:
Not sure what your inner architecture is, but let's assume it's an MDB reading in messages. I assert that your only requirement here for rigorous thread count sizing is to choose a maximum cap. If your MDB uses resources from a finite supplier like a JDBC connection pool, consider the maximum cap as the highest number of concurrent instances from that resource that you can tolerate taking. If the MDB's queue is remote, you probably want to consider remote connections (or technically, JMS sessions) a finite resource. If the MDB has less finite requirements (and the queue is local), your maximum cap becomes the number of threads, memory used and/or flat out CPU consumed by the working threads. The reasoning here is that the JBoss MDB container will simply keep allocating more MDB instances (and therefore threads) until the queue is empty or the maximum cap is reached. The only reason I can think of that you would really agonize over the minimum would be if the container's elapsed time or overhead to create new instances is above your tolerance and those operations are usually pretty small potatoes.
Producers
A general axiom of messaging is that producers nearly always outperform consumers. You would think this is pretty arbitrary, but it is a pattern I see recurring all the time, even in widely different messaging scenarios. Anyways, it's tough to say how the threading should work for the producer without knowing a bit about the application, but are you basically capable of [indefinitely] proportionally increasing the number of producer threads and the number of messages generated, or do you have some sort of cap where additional threads simply do not generate more messages ? I would guess it is the latter since most useful work has some limited data or calculation supplier. As I see it, the two drivers here are ordering and persistence.
First off, if you have strict message ordering where messages must be processed in strict (FPFP) First Produced First Processed then you're in a bit of a bind because you almost have to drop down to single threaded throughput unless you can devise some form of logical message demarcation (eg. a client number where any given client's messages are always sent to the same queue, but you may have multiple queues each serviced by one thread so each client is effectively FPFP).
Ordering aside, persistence is the next consideration in that if you have reliable and extensive message persistence, (or have a very high tolerance for message loss) just let the producer threads go to town. The messages will queue up reliably and eventually the consumers will [hopefully] catch up. However, if your message persistence message count or simple queue depths can potentially give you the willies when they get too high, here's where a tool might come in useful. If your producer thread count can be dynamically modified (which they can in many Java ThreadPool implementations) then you could sample the queue depths and raise or lower the producer thread count in accordance with the queue depth ranges you define, optionally to the point where if the consumers basically stall, so will the producers. I do not know of a specific tool that does this but between two JBoss servers this is fairly simple to whip up. Picking your queue depth-->producer thread count will be trickier.
Having said all that, I am going to actually read the article you linked to.....

I've got the perfect thing for you: IBM provide a free command line tool called perfharness.
It's aimed at benchmarking JMS providers, i.e. measuring the throughput of queues (single or multiple) given different numbers of producing or consuming threads.
Some features:
Send and consume messages at a fixed rate (msg/s) or at maximum rate possible on the queue
Use a specific number of threads
Use either JMS or native MQ
Can use data either generated randomly or taken from a file
Generates statistics telling you exactly how fast your queue is performing
The only down side is that it's not super intuitive, given the number of operations it supports. And IBM haven't open sourced it, which is a shame. However it sounds perfect for your purposes.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio