OpenCL events ambiguity - parallel-processing

Referring to clGetEventProfilingInfo documentation, cl_event resulted from clEnqueueNDRangeKernel could be:
CL_PROFILING_COMMAND_QUEUED
when the command identified by event is enqueued in a command-queue by
the host.
CL_PROFILING_COMMAND_SUBMIT
when the command identified by event that has been enqueued is
submitted by the host to the device associated with the commandqueue.
CL_PROFILING_COMMAND_START
when the command identified by event starts execution on the device.
CL_PROFILING_COMMAND_END
when the command identified by event has finished execution on the
device.
Assume visualizing the whole kernel profiling:
COMMAND_QUEUED -> COMMAND_SUBMIT -> COMMAND_START -> COMMAND_END
& the corresponding timeline:
Queueing -> Submitting -> Executing
Where:
Queueing = COMMAND_SUBMIT - COMMAND_QUEUED
Submitting = COMMAND_START - COMMAND_SUBMIT
Executing = COMMAND_END - COMMAND_START
Questions:
Is my previous equations true? if so, What's the real difference between queueing and submitting?
In other words, if I want to divide the whole process into COMMUNICATION (offloading) time and COMPUTATION (executing) time, What will be their equations?

Your interpretation seems fairly true. QUEUED is when you called the OpenCL API (such as clEnqueueNDRangeKernel). SUBMIT is when the runtime gave the work to the device. START is when it started execution, END is when the execution finished. There are three states between these four times. The first state is idle on the host. The second state is idle on the device. The third state is executing on the device. If you wish to combine the first two into "communication" then add them together (or use COMMAND_START - COMMAND_QUEUED).

Is my previous equations true?
Yes.
If so, What's the real difference between queueing and submitting? In other words, if I want to divide the whole process into COMMUNICATION (offloading) time and COMPUTATION (executing) time, What will be their equations?
Queueing:
Time spend waiting for other tasks to finish in order to start the current one. In other words waiting for CL_COMPLETE state of all the depending events, or having free resources in the current queued queue.
Note: CPUs will have 0 queue time when queueing to an idle device, because they are synchronous. While GPUs will ALWAYS have some small queueing time anyway (due to the asynchronous behaviour). This is the reason to pipeline as much as possible to GPU devices.
Submitting:
Time spent preparing the current task (compile LLVM, move buffers, preparing device Cores, etc), should be small, but not 0.
If you are looking for a formula only "Submitting" and "Executing" are valid for calculating the current task overhead. Ignore queueing since it does not depend on your task:
Active% = Executing/(Executing+Submitting)
Overhead% = Submitting/(Executing+Submitting)

Related

How is the current state of a task preserved when it is split across multiple time quanta in the round robin scheduling algorithm?

Assuming a single thread, what keeps the task from just running until completion in the round robin algorithm?
Is there some sort of watchdog mechanism to keep this from happening?
In a cooperative scheduling system, nothing. A task generally has to call some OS function (either an explicit yield or something else that may implicitly yield, like a message get function).
In a pre-emptive scheduling system, they are pre-empted (obviously) by the OS, the state is saved, and the next task is restored and run.
For example, Linux has a 100ms (from memory) quanta that it gives to each thread. The thread can relinquish its quanta early (and it's often treated nicely if it does so) but, if it uses its entire quanta, it's forcefully paused by the OS.

Runtime of GPU-based simulation unexplainable?

I am developing a GPU-based simulation using OpenGL and GLSL-Shaders and i found that performance increases when I add additional (unnecessary) GL-commands.
The simulation runs entirely on GPU without any transfers and basically consists of a loop performing 2500 algorithmically identical time steps. I carefully implemented caching of GLSL-uniform locations and removed any GL-state requests (glGet* etc) to maximize speed. To measure wall clock time i've put a glFinish after the main loop and take the elapsed time afterwards.
CASE A:
Normal total runtime for all iterations is 490ms.
CASE B:
Now, if i add a single additional glGetUniformLocation(...) command at the end of EACH time step, it requires only 475ms in total, which is 3 percent faster. (Please note that this is relevant to me since later i will perform a lot more timesteps)
I've looked at a timeline captured with Nvidia nsight and found that, in case A, all opengl commands are issued within the first 140ms and the glFinish takes 348ms until completion of all GPU-work. In case B the issuing of opengl commands is spread out over a significantly longer time (410ms) and the glFinish only takes 64ms yielding the faster 475ms in total.
I also noticed, that hardware command queue is much more full of work packets most of the time in case B, whereas in case A there is only one item waiting most of the time (however, there are no visible idle times).
So my questions are:
Why is B faster?
Why are the command packages issued more uniformly to the hardware queue over time in case A?
How can speed be enhanced without adding additional commands?
I am using Visual c++, VS2008 on Win7 x64.
IMHO this question can not be answered definitely. For what it's worth I experimentally determined, that glFinish (and …SwapBuffers for that matter) have weird runtime time behavior. I'm currently developing my own VR rendering library and prior to that I spend some significant time profiling the timelines of OpenGL commands and their interaction with the graphics system. And what I found out was, that the only thing that's consistent is, that glFinish + …SwapBuffers have a very inconsistent timing behavior.
What could happen is, that this glGetUniformLocation call pulls the OpenGL driver into a "busy" state. If you call glFinish immediately afterwards it may use a different method for waiting (for example it may spin in a while loop waiting for a flag) for the GPU than if you just call glFinish (it may for example wait for a signal or a condition variable and is thus subject to the kernels scheduling behavior).

Why one non-voluntary context switch per second?

The OS is RHEL 6 (2.6.32). I have isolated a core and am running a compute intensive thread on it. /proc/{thread-id}/status shows one non-voluntary context switch every second.
The thread in question is a SCHED_NORMAL thread and I don't want to change this.
How can I reduce this number of non-voluntary context switches? Does this depend on any scheduling parameters in /proc/sys/kernel?
EDIT: Several responses suggest alternative approaches. Before going that route, I first want to understand why I am getting exactly one non-voluntary context switch per second even over hours of run. For example, is this caused by CFS? If so, which parameters and how?
EDIT2: Further clarification - first question I would like an answer to is the following: Why am I getting one non-voluntary context switch per second instead of, say, one switch every half or two seconds?
This is a guess, but an educated one - since you use an isolated CPU the scheduler does not schedule any task except your own on it with one exception - the vmstat code in the kernel has a timer that schedules a single work queue item on each CPU once per second to calculate memory usage statistics and this is what you are seeing gets scheduled each second.
The work queue code is smart enough to not schedule the work queue kernel thread if the core is 100% idle but not if it is running a single task.
You can verify this using ftrace. If the sched_switch tracer shows that the entity you switch to once every second or so (the value is rounded to the nearest jiffie events and the timer does not count when the cpu is idle so this might skew the timing) is the events/CPU_NUMBER task (or keventd for older kernels), then it's almost 100% that the cause is indeed the vmstat_update function setting its timer to queue a work queue item every second which the events kernel thread runs.
Note that the cycle at which vmstat sets its timer is configurable - you can set it to other value via the vm.stat_interval sysctl knob. Increasing this value will give you a lower rate of such interruptions at the cost of less accurate memory usage statistics.
I maintain a wiki with all the sources of interruptions to isolated CPU work loads here. I also have a patch in the works for getting vmstat to not schedule the work queue item if there is no change between one vmstat work queue run to the next - such as would happen if your single task on the CPU does not use any dynamic memory allocations. Not sure it will benefit you, though - it depends on your work load.
I strongly suggest you try to optimize the code itself so that when it's running on a CPU, you get the maximum out of it.
Anyhow, I am not sure this will work, but give it a try anyway and let us know:
What I'll basically do is just set the scheduling policy to be FIFO then give the process the maximum priority possible.
#include<sched.h>
struct sched_param sp = sched_get_priority_max(SCHED_FIFO);
int ret;
ret = sched_setscheduler(0, SCHED_FIFO, &sp);
if (ret == -1) {
perror("sched_setscheduler");
return 1;
}
Please keep in mind that any blocking statement your process makes is MOST LIKELY gonna cause the scheduler to get it off the CPU.
Source
Man page
EDIT:
Sorry, just noticed the pthread tag; the concept still holds so check out this man page:
http://www.kernel.org/doc/man-pages/online/pages/man3/pthread_setschedparam.3.html
If one interrupt per second on your dedicated CPU is still too much, then you really need to not go through the normal scheduler at all. May I suggest the real-time and isochronous priority levels, that can leave your process scheduled more reliably than the usual pre-emptive mechanisms?

How do you limit a process' CPU usage on Windows? (need code, not an app)

There is programs that is able to limit the CPU usage of processes in Windows. For example BES and ThreadMaster. I need to write my own program that does the same thing as these programs but with different configuration capabilities. Does anybody know how the CPU throttling of a process is done (code)? I'm not talking about setting the priority of a process, but rather how to limit it's CPU usage to for example 15% even if there is no other processes competing for CPU time.
Update: I need to be able to throttle any processes that is already running and that I have no source code access to.
You probably want to run the process(es) in a job object, and set the maximum CPU usage for the job object with SetInformationJobObject, with JOBOBJECT_CPU_RATE_CONTROL_INFORMATION.
Very simplified, it could work somehow like this:
Create a periodic waitable timer with some reasonable small wait time (maybe 100ms). Get a "last" value for each relevant process by calling GetProcessTimes once.
Loop forever, blocking on the timer.
Each time you wake up:
if GetProcessAffinityMask returns 0, call SetProcessAffinityMask(old_value). This means we've suspended that process in our last iteration, we're now giving it a chance to run again.
else call GetProcessTimes to get the "current" value
call GetSystemTimeAsFileTime
calculate delta by subtracting last from current
cpu_usage = (deltaKernelTime + deltaUserTime) / (deltaTime)
if that's more than you want call old_value = GetProcessAffinityMask followed by SetProcessAffinityMask(0) which will take the process offline.
This is basically a very primitive version of the scheduler that runs in the kernel, implemented in userland. It puts a process "to sleep" for a small amount of time if it has used more CPU time than what you deem right. A more sophisticated measurement maybe going over a second or 5 seconds would be possible (and probably desirable).
You might be tempted to suspend all threads in the process instead. However, it is important not to fiddle with priorities and not to use SuspendThread unless you know exactly what a program is doing, as this can easily lead to deadlocks and other nasty side effects. Think for example of suspending a thread holding a critical section while another thread is still running and trying to acquire the same object. Or imagine your process gets swapped out in the middle of suspending a dozen threads, leaving half of them running and the other half dead.
Setting the affinity mask to zero on the other hand simply means that from now on no single thread in the process gets any more time slices on any processor. Resetting the affinity gives -- atomically, at the same time -- all threads the possibility to run again.
Unluckily, SetProcessAffinityMask does not return the old mask as SetThreadAffinityMask does, at least according to the documentation. Therefore an extra Get... call is necessary.
CPU usage is fairly simple to estimate using QueryProcessCycleTime. The machine's processor speed can be obtained from HKLM\HARDWARE\DESCRIPTION\System\CentralProcessor\\~MHz (where is the processor number, one entry for each processor present). With these values, you can estimate your process' CPU usage and yield the CPU as necessary using Sleep() to keep your usage in bounds.

What effect does changing the process priority have in Windows?

If you go into Task Manager, right click a process, and set priority to Realtime, it often stops program crashes, or makes them run faster.
In a programming context, what does this do?
It calls SetPriorityClass().
Every thread has a base priority level determined by the thread's
priority value and the priority class of its process. The system uses
the base priority level of all executable threads to determine which
thread gets the next slice of CPU time. The SetThreadPriority function
enables setting the base priority level of a thread relative to the
priority class of its process. For more information, see Scheduling
Priorities.
It tells the widows scheduler to be more or less greedy when allocating execution time slices to your process. Realtime execution makes it never yield execution (not even to drivers, according to MSDN), which may cause stalls in your app if it waits on external events but has no yielding of its own(like Sleep, SwitchToThread or WaitFor[Single|Multiple]Objects), as such using realtime should be avoided unless you know that the application will handle it correctly.
It works by changing the weight given to this process in the OS task scheduler. Your CPU can only execute one instruction at a time (to put it very, very simply) and the OS's job is to keep swapping instructions from each running process. By raising or lowering the priority, you're affecting how much time it's allotted in the CPU relative to other applications currently being multi-tasked.

Resources