What is the relation between number of thread and number of processor cores? - windows

I am writing a server application that is thread pool based(IOCP). But I don't know how many threads are appropriate. Is the thread number associated with the number of processor cores?

If your work items never block, use threads = cores. If your threads never need to be descheduled you can max out all cores by creating one thread per core.
If your work items sometimes block (which they shouldn't do much if you want to make best use of IOCP) you need more threads. You need to measure how many.

Multiple threads make up a process, and the number of threads is not dependent on the number of cores. A single core processor can handle a multi-thread process using various scheduling schemes. That said, if you have multiple cores on your processor, you can have different threads run concurrently. So to run multiple threads at the same time, you need multiple cores, but to run multiple threads, but not necessarily simultaneously (can seem simultaneous though), you can use a single core by implementing a scheduling system.
Some useful wiki pages for you:
http://en.wikipedia.org/wiki/Computer_multitasking
http://en.wikipedia.org/wiki/Thread_%28computing%29
http://en.wikipedia.org/wiki/Input/output_completion_port
http://en.wikipedia.org/wiki/Scheduling_%28computing%29
http://en.wikipedia.org/wiki/Thread_pool_pattern

Related

How to control how many tasks to run per executor in PySpark [duplicate]

I don't quite understand spark.task.cpus parameter. It seems to me that a “task” corresponds to a “thread” or a "process", if you will, within the executor. Suppose that I set "spark.task.cpus" to 2.
How can a thread utilize two CPUs simultaneously? Couldn't it require locks and cause synchronization problems?
I'm looking at launchTask() function in deploy/executor/Executor.scala, and I don't see any notion of "number of cpus per task" here. So where/how does Spark eventually allocate more than one cpu to a task in the standalone mode?
To the best of my knowledge spark.task.cpus controls the parallelism of tasks in you cluster in the case where some particular tasks are known to have their own internal (custom) parallelism.
In more detail:
We know that spark.cores.max defines how many threads (aka cores) your application needs. If you leave spark.task.cpus = 1 then you will have #spark.cores.max number of concurrent Spark tasks running at the same time.
You will only want to change spark.task.cpus if you know that your tasks are themselves parallelized (maybe each of your task spawns two threads, interacts with external tools, etc.) By setting spark.task.cpus accordingly, you become a good "citizen". Now if you have spark.cores.max=10 and spark.task.cpus=2 Spark will only create 10/2=5 concurrent tasks. Given that your tasks need (say) 2 threads internally the total number of executing threads will never be more than 10. This means that you never go above your initial contract (defined by spark.cores.max).

Puma clustering benefits for site which handles lots of uploads/downloads

I'm trying to understand the benefits of using puma clustering. The GitHub says that the number puma workers should be set to the number of cpu cores, and the default number of threads for each is 0-16. The worker processes can run in parallel while the threads run concurrently. It was my understanding that The MRI GIL only allows one thread across all cores to run Ruby code, so how does puma enable things to run in parallel /provide benefits over running one worker process with double the amount of threads? The site I'm working on is heavily IO bound, handling several uploads and downloads at the same time - any config suggestions for this set up are also welcome.
The workers in clustered mode will actually spawn new child processes each of which has its own "GIL". Only one thread in a single process can be running code at one time, thus having a process per cpu core works well because each cpu can only be doing one thing at a time. It also makes sense to run multiple threads per process because if a thread is waiting for IO it will allow another thread to execute.

Optimal size of worker pool

I'm building a Go app which uses a "worker pool" of goroutines, initially I start the pool creating a number of workers. I was wondering what would be the optimal number of workers in a mult-core processor, for example in a CPU with 4 cores ? I'm currently using the following aproach:
// init pool
numCPUs := runtime.NumCPU()
runtime.GOMAXPROCS(numCPUs + 1) // numCPUs hot threads + one for async tasks.
maxWorkers := numCPUs * 4
jobQueue := make(chan job.Job)
module := Module{
Dispatcher: job.NewWorkerPool(maxWorkers),
JobQueue: jobQueue,
Router: router,
}
// A buffered channel that we can send work requests on.
module.Dispatcher.Run(jobQueue)
The complete implementation is under
job.NewWorkerPool(maxWorkers)
and
module.Dispatcher.Run(jobQueue)
My use-case for using a worker pool: I have a service which accepts requests and calls multiple external APIs and aggregate their results into a single response. Each call can be done independently from others as the order of results doesn't matter. I dispatch the calls to the worker pool where each call is done in one available goroutine in an asynchronous way. My "request" thread keeps listening to the return channels while fetching and aggregating results as soon as a worker thread is done. When all are done the final aggregated result is returned as a response. Since each external API call may render variable response times some calls can be completed earlier than others. As per my understanding doing it in a parallel way would be better in terms of performance as if compared to doing it in a synchronous way calling each external API one after another
The comments in your sample code suggest you may be conflating the two concepts of GOMAXPROCS and a worker pool. These two concepts are completely distinct in Go.
GOMAXPROCS sets the maximum number of CPU threads the Go runtime will use. This defaults to the number of CPU cores found on the system, and should almost never be changed. The only time I can think of to change this would be if you wanted to explicitly limit a Go program to use fewer than the available CPUs for some reason, then you might set this to 1, for example, even when running on a 4-core CPU. This should only ever matter in rare situations.
TL;DR; Never set runtime.GOMAXPROCS manually.
Worker pools in Go are a set of goroutines, which handle jobs as they arrive. There are different ways of handling worker pools in Go.
What number of workers should you use? There is no objective answer. Probably the only way to know is to benchmark various configurations until you find one that meets your requirements.
As a simple case, suppose your worker pool is doing something very CPU-intensive. In this case, you probably want one worker per CPU.
As a more likely example, though, lets say your workers are doing something more I/O bound--such as reading HTTP requests, or sending email via SMTP. In this case, you may reasonably handle dozens or even thousands of workers per CPU.
And then there's also the question of if you even should use a worker pool. Most problems in Go do not require worker pools at all. I've worked on dozens of production Go programs, and never once used a worker pool in any of them. I've also written many times more one-time-use Go tools, and only used a worker pool maybe once.
And finally, the only way in which GOMAXPROCS and worker pools relate is the same as how goroutines relates to GOMAXPROCS. From the docs:
The GOMAXPROCS variable limits the number of operating system threads that can execute user-level Go code simultaneously. There is no limit to the number of threads that can be blocked in system calls on behalf of Go code; those do not count against the GOMAXPROCS limit. This package's GOMAXPROCS function queries and changes the limit.
From this simple description, it's easy to see that there could be many more (potentially hundreds of thousands... or more) goroutines than GOMAXPROCS--GOMAXPROCS only limits how many "operating system threads that can execute user-level Go code simultaneously"--goroutines which aren't executing user-level Go code at the moment don't count. And in I/O-bound goroutines (such as those waiting for a network response) aren't executing code. So you have a theoretical maximum number of goroutines limited only by your system's available memory.

Detecting CPU load on different machines

I am trying to create a background task scheduler for my process, which needs to schedule the tasks(compute intensive) parallelly while maintaining the responsiveness of the UI.
Currently, I am using CPU usage(percentage) to against a threshold (~50%) for the scheduler to start a new task, and it sort of works fine.
This program can run on a variety hardware configurations( e.g processor speed, number of cores), so 50% limit can be too harsh or soft for certain configurations.
Is there any good way to include different parameters of CPU configuration e.g cores, speed; which can dynamically come up with a threshold number based on the hardware configuration?
My suggestions:
Run as many threads as CPUs in the system.
Set the priority of each thread to an idle (lowest)
In the thread main loop do a smallest sleep possible, i.e. usleep(1)

Does PPL take the load of the system into account when creating threads or not?

I am starting to use PPL to create tasks and dispatch them [possibly] to other threads, like this:
Concurrency::task_group tasks;
auto simpleTask = Concurrency::make_task(&simpleFunction);
tasks.run(simpleTask);
I experimented with a small application that creates a task every second. Task performs heavy calculations during 5 seconds and then stops.
I wanted to know how many threads PPL creates on my machine and whether the load of the machine influences the number of threads or the tasks assigned to the threads. When I run one or more instances of my application on my 12-core machine, I notice this:
When running 1 application, it creates 6 threads. Total CPU usage is 50%.
When running 2 applications, both of them create 6 threads. Total CPU usage is 100% but my machine stays rather responsive.
When running 3 applications, all of them create 6 threads (already 18 threads in total). Total CPU usage is 100%.
When running 4 applications, I already have 24 threads in total.
I investigated the running applications with Process Explorer and with 4 applications I can clearly see that they all have 6 (sometimes even 12) threads that are all trying to consume as much CPU as possible.
PPL allows you to limit the number of threads by configuring the default scheduler, like this:
Concurrency::SchedulerPolicy policy(1, Concurrency::MaxConcurrency,2);
Concurrency::Scheduler::SetDefaultSchedulerPolicy(policy);
With this you statically limit the number of threads (2 in this case). It can be handy if you know beforehand that on a server with 24 cores there are 10 simultaneous users (so you can limit every application to 2 threads), but if one of the 10 users is working late, he still only uses 2 threads, while the rest of the machine is idling.
My question: is there a way to configure PPL so that it dynamically decides how many threads to create (or keep alive or keep active) based on the load of the machine? Or does PPL already does this by default and my observations are incorrect.
EDIT: I tried starting more instances of my test application, and although my machine remains quite responsive (I was wrong in the original question) I can't see the applications reducing their number of simultaneous actions.
The short answer to your question is "No." The default PPL scheduler and resource manager will only use process-local information to decide when to create/destroy threads. As stated in the Patterns and Practices article on MSDN:
The resource manager is a singleton that works across one process. It
does not coordinate processor resources across multiple
operating-system processes. If your application uses multiple,
concurrent processes, you may need to reduce the level of concurrency
in each process for optimum efficiency.
If you're willing to accept the complexity, you may be able to implement a custom scheduler/resource manager to take simple system-level performance readings (e.g. using the PDH functions) to achieve what you want.

Resources