How to control how many tasks to run per executor in PySpark [duplicate] - performance

I don't quite understand spark.task.cpus parameter. It seems to me that a “task” corresponds to a “thread” or a "process", if you will, within the executor. Suppose that I set "spark.task.cpus" to 2.
How can a thread utilize two CPUs simultaneously? Couldn't it require locks and cause synchronization problems?
I'm looking at launchTask() function in deploy/executor/Executor.scala, and I don't see any notion of "number of cpus per task" here. So where/how does Spark eventually allocate more than one cpu to a task in the standalone mode?

To the best of my knowledge spark.task.cpus controls the parallelism of tasks in you cluster in the case where some particular tasks are known to have their own internal (custom) parallelism.
In more detail:
We know that spark.cores.max defines how many threads (aka cores) your application needs. If you leave spark.task.cpus = 1 then you will have #spark.cores.max number of concurrent Spark tasks running at the same time.
You will only want to change spark.task.cpus if you know that your tasks are themselves parallelized (maybe each of your task spawns two threads, interacts with external tools, etc.) By setting spark.task.cpus accordingly, you become a good "citizen". Now if you have spark.cores.max=10 and spark.task.cpus=2 Spark will only create 10/2=5 concurrent tasks. Given that your tasks need (say) 2 threads internally the total number of executing threads will never be more than 10. This means that you never go above your initial contract (defined by spark.cores.max).

Related

Difference between boundedElastic() vs parallel() scheduler

I'm new to Project reactor and trying to understand difference between boundedElastic() vs parallel() scheduler. Documentation says that boundedElastic() is used for blocking tasks and parallel() for non-blocking tasks.
Why do Project reactor need to address blocking scenario as they are non-blocking in nature. Can someone please help me out with some real world use case for boundedElastic() vs parallel() scheduler
?
The parallel flavor is backed by N workers (according to the N cpus) each based on a ScheduledExecutorService. If you submit N long lived tasks to it, no more work can be executed, hence the affinity for short-lived tasks.
The elastic flavor is also backed by workers based on ScheduledExecutorService, except it creates these workers on demand and pools them.
BoundedElastic is same as elastic, difference is that you can limit the total no. of threads.
https://spring.io/blog/2019/12/13/flight-of-the-flux-3-hopping-threads-and-schedulers
TL;DR
Reactor executes non-blocking/async tasks on a small number of threads. In case task is blocking - thread would be blocked and all other tasks would be waiting for it.
parallel should be used for fast non-blocking operation (default option)
boundedElastic should be used to "offload" blocking tasks
In general Reactor API is concurrency-agnostic that use Schedulers abstraction to execute tasks. Schedulers have responsibilities very similar to ExecutorService.
Schedulers.parallel()
Should be a default option and used for fast non-blocking operation on a small number of threads. By default, number of threads is equal to number of CPU cores. It could be controlled by reactor.schedulers.defaultPoolSize system property.
Schedulers.boundedElastic()
Used to execute longer operations (blocking tasks) as a part of the reactive flow. It will use thread pool with a default number of threads number of CPU cores x 10 (could be controlled by reactor.schedulers.defaultBoundedElasticSize) and default queue size of 100000 per thread (reactor.schedulers.defaultBoundedElasticSize).
subscribeOn or publishOn could be used to change the scheduler.
The following code shows how to wrap blocking operation
Mono.fromCallable(() -> {
// blocking operation
}).subscribeOn(Schedulers.boundedElastic()); // run on a separate scheduler because code is blocking
Schedulers.newBoundedElastic()
Similar to Schedulers.boundedElastic() but is useful when you need to create a separate thread pool for some operation.
Sometimes it's not obvious what code is blocking. One very useful tool while testing reactive code is BlockHound
Schedulers provides various Scheduler flavors usable by publishOn or subscribeOn :
1)parallel(): Optimized for fast Runnable non-blocking executions
2)single(): Optimized for low-latency Runnable one-off executions
3)elastic(): Optimized for longer executions, an alternative for blocking tasks where the number of active tasks (and threads) can grow indefinitely
4)boundedElastic(): Optimized for longer executions, an alternative for
fromExecutorService(ExecutorService) to create new instances around Executors
https://projectreactor.io/docs/core/release/api/reactor/core/scheduler/Schedulers.html

Detecting CPU load on different machines

I am trying to create a background task scheduler for my process, which needs to schedule the tasks(compute intensive) parallelly while maintaining the responsiveness of the UI.
Currently, I am using CPU usage(percentage) to against a threshold (~50%) for the scheduler to start a new task, and it sort of works fine.
This program can run on a variety hardware configurations( e.g processor speed, number of cores), so 50% limit can be too harsh or soft for certain configurations.
Is there any good way to include different parameters of CPU configuration e.g cores, speed; which can dynamically come up with a threshold number based on the hardware configuration?
My suggestions:
Run as many threads as CPUs in the system.
Set the priority of each thread to an idle (lowest)
In the thread main loop do a smallest sleep possible, i.e. usleep(1)

Does PPL take the load of the system into account when creating threads or not?

I am starting to use PPL to create tasks and dispatch them [possibly] to other threads, like this:
Concurrency::task_group tasks;
auto simpleTask = Concurrency::make_task(&simpleFunction);
tasks.run(simpleTask);
I experimented with a small application that creates a task every second. Task performs heavy calculations during 5 seconds and then stops.
I wanted to know how many threads PPL creates on my machine and whether the load of the machine influences the number of threads or the tasks assigned to the threads. When I run one or more instances of my application on my 12-core machine, I notice this:
When running 1 application, it creates 6 threads. Total CPU usage is 50%.
When running 2 applications, both of them create 6 threads. Total CPU usage is 100% but my machine stays rather responsive.
When running 3 applications, all of them create 6 threads (already 18 threads in total). Total CPU usage is 100%.
When running 4 applications, I already have 24 threads in total.
I investigated the running applications with Process Explorer and with 4 applications I can clearly see that they all have 6 (sometimes even 12) threads that are all trying to consume as much CPU as possible.
PPL allows you to limit the number of threads by configuring the default scheduler, like this:
Concurrency::SchedulerPolicy policy(1, Concurrency::MaxConcurrency,2);
Concurrency::Scheduler::SetDefaultSchedulerPolicy(policy);
With this you statically limit the number of threads (2 in this case). It can be handy if you know beforehand that on a server with 24 cores there are 10 simultaneous users (so you can limit every application to 2 threads), but if one of the 10 users is working late, he still only uses 2 threads, while the rest of the machine is idling.
My question: is there a way to configure PPL so that it dynamically decides how many threads to create (or keep alive or keep active) based on the load of the machine? Or does PPL already does this by default and my observations are incorrect.
EDIT: I tried starting more instances of my test application, and although my machine remains quite responsive (I was wrong in the original question) I can't see the applications reducing their number of simultaneous actions.
The short answer to your question is "No." The default PPL scheduler and resource manager will only use process-local information to decide when to create/destroy threads. As stated in the Patterns and Practices article on MSDN:
The resource manager is a singleton that works across one process. It
does not coordinate processor resources across multiple
operating-system processes. If your application uses multiple,
concurrent processes, you may need to reduce the level of concurrency
in each process for optimum efficiency.
If you're willing to accept the complexity, you may be able to implement a custom scheduler/resource manager to take simple system-level performance readings (e.g. using the PDH functions) to achieve what you want.

parallel programs in Multi Core processors

Assume im trying to run parallel program with 3 different tasks on quad core processor
my question is ,when these tasks run simultaneously ,will they be computed on each core of processor
or in what way they are executed simultaneously?
If you are using c# and parallel lib then yes, they would get queued up in the thread pool and executed in parallel, but there are few other factors that are very important to consider.
Such as:
- Is there is any shared data?
- Does one process need to wait on another?
Also order of execution is not guaranteed.

What is the relation between number of thread and number of processor cores?

I am writing a server application that is thread pool based(IOCP). But I don't know how many threads are appropriate. Is the thread number associated with the number of processor cores?
If your work items never block, use threads = cores. If your threads never need to be descheduled you can max out all cores by creating one thread per core.
If your work items sometimes block (which they shouldn't do much if you want to make best use of IOCP) you need more threads. You need to measure how many.
Multiple threads make up a process, and the number of threads is not dependent on the number of cores. A single core processor can handle a multi-thread process using various scheduling schemes. That said, if you have multiple cores on your processor, you can have different threads run concurrently. So to run multiple threads at the same time, you need multiple cores, but to run multiple threads, but not necessarily simultaneously (can seem simultaneous though), you can use a single core by implementing a scheduling system.
Some useful wiki pages for you:
http://en.wikipedia.org/wiki/Computer_multitasking
http://en.wikipedia.org/wiki/Thread_%28computing%29
http://en.wikipedia.org/wiki/Input/output_completion_port
http://en.wikipedia.org/wiki/Scheduling_%28computing%29
http://en.wikipedia.org/wiki/Thread_pool_pattern

Resources