Detecting CPU load on different machines - performance

I am trying to create a background task scheduler for my process, which needs to schedule the tasks(compute intensive) parallelly while maintaining the responsiveness of the UI.
Currently, I am using CPU usage(percentage) to against a threshold (~50%) for the scheduler to start a new task, and it sort of works fine.
This program can run on a variety hardware configurations( e.g processor speed, number of cores), so 50% limit can be too harsh or soft for certain configurations.
Is there any good way to include different parameters of CPU configuration e.g cores, speed; which can dynamically come up with a threshold number based on the hardware configuration?

My suggestions:
Run as many threads as CPUs in the system.
Set the priority of each thread to an idle (lowest)
In the thread main loop do a smallest sleep possible, i.e. usleep(1)

Related

Does the Windows scheduler sometimes fail to preempt a running thread immediately to let a higher-priority thread run?

My application operates on pairs of long vectors - say it adds them together to produce a vector result. Its rules state that it must completely finish with one pair before it can be given another. I would like to use multiple threads to speed things up. I am running Windows 10.
I created an OpenMP parallel for construct and divided the vector among all the threads of the team. All threads start, all threads run pretty fast, so the multithreading is effective.
But the speedup is slight, and the reason is that some of the time, one of the worker threads takes way longer than usual. I have instrumented the operation, and I see that sometimes the worker threads take a long time to start - delay varies from 20 microseconds on average to dozens of milliseconds depending on system load. The master thread does not show this delay.
That makes me think that the scheduler is taking some time to start the worker threads. The master thread is already running, so it doesn't have to wait to be started.
But here is the nub of the question: raising the priority of the process doesn't make any difference. I can raise it to high priority or even realtime priority, and I still see that startup of the worker threads is often delayed. It looks like the Windows scheduler is not fully preemptive, and sometimes lets a lower-priority thread run when a higher-priority one is eligible. Can anyone confirm this?
I have verified that the worker threads are created with the default OS priority, namely the base priority of the class of the master process. This should be higher that the priority of any running thread, I think. Or is it normal for there to be some thread with realtime priority that might be blocking my workers? I don't see one with Task Manager.
I guess one last possibility is that the task switch might take 20 usec. Is that plausible?
I have a 4-core system without hyperthreading.

How to control how many tasks to run per executor in PySpark [duplicate]

I don't quite understand spark.task.cpus parameter. It seems to me that a “task” corresponds to a “thread” or a "process", if you will, within the executor. Suppose that I set "spark.task.cpus" to 2.
How can a thread utilize two CPUs simultaneously? Couldn't it require locks and cause synchronization problems?
I'm looking at launchTask() function in deploy/executor/Executor.scala, and I don't see any notion of "number of cpus per task" here. So where/how does Spark eventually allocate more than one cpu to a task in the standalone mode?
To the best of my knowledge spark.task.cpus controls the parallelism of tasks in you cluster in the case where some particular tasks are known to have their own internal (custom) parallelism.
In more detail:
We know that spark.cores.max defines how many threads (aka cores) your application needs. If you leave spark.task.cpus = 1 then you will have #spark.cores.max number of concurrent Spark tasks running at the same time.
You will only want to change spark.task.cpus if you know that your tasks are themselves parallelized (maybe each of your task spawns two threads, interacts with external tools, etc.) By setting spark.task.cpus accordingly, you become a good "citizen". Now if you have spark.cores.max=10 and spark.task.cpus=2 Spark will only create 10/2=5 concurrent tasks. Given that your tasks need (say) 2 threads internally the total number of executing threads will never be more than 10. This means that you never go above your initial contract (defined by spark.cores.max).

Individual Spark Task consume more time on computation if more cores are assigned

I am running a spark job with input file of size 6.6G (hdfs) with master as local. My Spark Job with 53 partitions completed quickly when I assign local[6] than local[2], however the individual task takes more computation time when number of cores are more. Say if I assign 1 core(local[1]) then each task takes 3 secs where the same goes up to 12 seconds if I assign 6 cores (local[6]). Where the time gets wasted? The spark UI shows increase in computation time for each task in local[6] case, I couldn't understand the reason why the same code takes different computation time when more cores are assigned.
Update:
I could see more %iowait in iostat output if I use local[6] than local[1]. Please let me know this is the only reason or any possible reasons. I wonder why this iowait is not reported in sparkUI. I see the increase in computing time than iowait time.
I am assuming you are referring to spark.task.cpus and not spark.cores.max
With spark.tasks.cpus each task get assigned more cores, but it doesn't necessarily have to use them. If you process is single threaded it really can't use them. You wind up with additional overhead without additional benefit and those cores are taken away from other single threaded tasks that can use them.
With spark.cores.max it is simply and overhead issue with transferring data around at the same time.

Does PPL take the load of the system into account when creating threads or not?

I am starting to use PPL to create tasks and dispatch them [possibly] to other threads, like this:
Concurrency::task_group tasks;
auto simpleTask = Concurrency::make_task(&simpleFunction);
tasks.run(simpleTask);
I experimented with a small application that creates a task every second. Task performs heavy calculations during 5 seconds and then stops.
I wanted to know how many threads PPL creates on my machine and whether the load of the machine influences the number of threads or the tasks assigned to the threads. When I run one or more instances of my application on my 12-core machine, I notice this:
When running 1 application, it creates 6 threads. Total CPU usage is 50%.
When running 2 applications, both of them create 6 threads. Total CPU usage is 100% but my machine stays rather responsive.
When running 3 applications, all of them create 6 threads (already 18 threads in total). Total CPU usage is 100%.
When running 4 applications, I already have 24 threads in total.
I investigated the running applications with Process Explorer and with 4 applications I can clearly see that they all have 6 (sometimes even 12) threads that are all trying to consume as much CPU as possible.
PPL allows you to limit the number of threads by configuring the default scheduler, like this:
Concurrency::SchedulerPolicy policy(1, Concurrency::MaxConcurrency,2);
Concurrency::Scheduler::SetDefaultSchedulerPolicy(policy);
With this you statically limit the number of threads (2 in this case). It can be handy if you know beforehand that on a server with 24 cores there are 10 simultaneous users (so you can limit every application to 2 threads), but if one of the 10 users is working late, he still only uses 2 threads, while the rest of the machine is idling.
My question: is there a way to configure PPL so that it dynamically decides how many threads to create (or keep alive or keep active) based on the load of the machine? Or does PPL already does this by default and my observations are incorrect.
EDIT: I tried starting more instances of my test application, and although my machine remains quite responsive (I was wrong in the original question) I can't see the applications reducing their number of simultaneous actions.
The short answer to your question is "No." The default PPL scheduler and resource manager will only use process-local information to decide when to create/destroy threads. As stated in the Patterns and Practices article on MSDN:
The resource manager is a singleton that works across one process. It
does not coordinate processor resources across multiple
operating-system processes. If your application uses multiple,
concurrent processes, you may need to reduce the level of concurrency
in each process for optimum efficiency.
If you're willing to accept the complexity, you may be able to implement a custom scheduler/resource manager to take simple system-level performance readings (e.g. using the PDH functions) to achieve what you want.

What is the relation between number of thread and number of processor cores?

I am writing a server application that is thread pool based(IOCP). But I don't know how many threads are appropriate. Is the thread number associated with the number of processor cores?
If your work items never block, use threads = cores. If your threads never need to be descheduled you can max out all cores by creating one thread per core.
If your work items sometimes block (which they shouldn't do much if you want to make best use of IOCP) you need more threads. You need to measure how many.
Multiple threads make up a process, and the number of threads is not dependent on the number of cores. A single core processor can handle a multi-thread process using various scheduling schemes. That said, if you have multiple cores on your processor, you can have different threads run concurrently. So to run multiple threads at the same time, you need multiple cores, but to run multiple threads, but not necessarily simultaneously (can seem simultaneous though), you can use a single core by implementing a scheduling system.
Some useful wiki pages for you:
http://en.wikipedia.org/wiki/Computer_multitasking
http://en.wikipedia.org/wiki/Thread_%28computing%29
http://en.wikipedia.org/wiki/Input/output_completion_port
http://en.wikipedia.org/wiki/Scheduling_%28computing%29
http://en.wikipedia.org/wiki/Thread_pool_pattern

Resources