Intel Parallel Studio timing inconsistencies - visual-studio-2010

I have some code that uses Intel TBB and I'm running on a 32 core machine. In the code, I use
parallel_for(blocked_range (2,left_image_width-2, left_image_width /32) ...
to spawn 32 to threads that do concurrent work, there are no race conditions and each thread is hopefully given the same amount of work. I'm using clock_t to measure how long my program takes. For a certain image, it takes roughly 19 seconds to complete.
Then I ran my code through Intel Parallel Studio and it ran the code in 2 seconds. This is the result I was expecting but I can't figure out why there's such a large difference between the two. Does time_t sum the clock cycles on all the cores? Even then it doesn't make sense. Below is the snippet in question.
clock_t begin=clock();
create_threads_and_do_work();
clock_t end=clock();
double diffticks=end-begin;
double diffms=(diffticks*1000)/CLOCKS_PER_SEC;
cout<<"And the time is "<<diffms<<" ms"<<endl;
Any advice would be appreciated.

It's isn't quite clear if the difference in run time is a result of two different inputs (images) or simply two different run-time measuring methods (clock_t difference vs. Intel software measurement). Furthermore, you aren't showing us what goes on in create_threads_and_do_work(), and you didn't mention what tool within Intel Parallel Studio you are using, is it Vtune?
Your clock_t difference method will sum the processing time of the thread that called it (the main thread in your example), but it might not count the processing time of the threads spawned within create_threads_and_do_work(). Whether it does or doesn't depends on whether within that function you wait for all threads to complete and only then exit the function or if you simply spawn the threads and exit immediately (before they complete processing). If all you do in the function is that parallel_for(), then the clock_t difference should yield the right result and should be no different than other run-time measurements.
Within Intel Parallel Studio there is a profiling tool called Vtune. is a powerful tool and When you run your program through it you can view (in a graphically pleasing way) the processing time (as well as times called) of each function in your code. I'm pretty sure after doing this you'll probably figure it out.
One last idea - did the program complete its course when using Intel software? I'm asking because sometimes Vtune will collect data for some time and then stop without allowing the program to complete.

Related

Does context switching usually happen between calling a function, and executing it?

So I have been working on the source code of a complex application (written by hundreds of programmers) for a while now. And among other things, I have created some time checking functions, along with suitable data structures to measure execution periods of different segments of the main loop and run some analysis on these measurements.
Here's a pseudocode that helps explaining:
main()
{
TimeSlicingSystem::AddTimeSlice(0);
FunctionA();
TimeSlicingSystem::AddTimeSlice(3);
FuncitonB();
TimeSlicingSystem::AddTimeSlice(6);
PrintTimeSlicingValues();
}
void FunctionA()
{
TimeSlicingSystem::AddTimeSlice(1);
//...
TimeSlicingSystem::AddTimeSlice(2);
}
FuncitonB()
{
TimeSlicingSystem::AddTimeSlice(4);
//...
TimeSlicingSystem::AddTimeSlice(5);
}
PrintTimeSlicingValues()
{
//Prints the different between each slice, and the slice before it,
//starting from slice number 1.
}
Most measurements were very reasonable, for instance assigning a value to a local variable will cost less than a fraction of a microsecond. Most functions will execute from start to finish in a few microseconds, and rarely ever reach one millisecond.
I then ran a few tests for half an hour or so, and I found some strange results that I couldn't quite understand. Certain functions will be called, and when measuring the time from the moment of calling the function (last line in 'calling' code) to the first line inside the 'called' function will take a very long time, up to a 30 milliseconds period. That's happening in a loop that would otherwise complete a full iteration in less than 8 milliseconds.
To get a picture of that, in the pseudocode I included, the time period between the slice number 0, and the slice number 1, or the time between the slice number 3, and the slice number 4 is measured. This the sort of periods I am referring to. It is the measured time between calling a function, and running the first line inside the called function.
QuestionA. Could this behavior be due to thread, or process switching by the OS? Does calling a function is a uniquely vulnerable spot to that? The OS I am working on is Windows 10.
Interestingly enough, there was never a last line in a function returning to the first line after the call in the 'calling' code problem at all ( periods from slice number 2 to 3 or from 5 to 6 in pseudocode)! And all measurements were always less than 5 microseconds.
QuestionB. Could this be, in any way, due to the time measurement method I am using? Could switching between different cores gives an allusion of slower than actually is context switching due to clock differences? (although I never found a single negative delta time so far, which seems to refute this hypothesis altogether). Again, the OS I am working on is Windows 10.
My time measuring function looks looks this:
FORCEINLINE double Seconds()
{
Windows::LARGE_INTEGER Cycles;
Windows::QueryPerformanceCounter(&Cycles);
// add big number to make bugs apparent where return value is being passed to float
return Cycles.QuadPart * GetSecondsPerCycle() + 16777216.0;
}
QuestionA. Could this behavior be due to thread, or process switching by the OS?
Yes. Thread switches can happen at any time (e.g. when a device sends an IRQ that causes a different higher priority thread to unblock and preempt your thread immediately) and this can/will cause unexpected time delays in your thread.
Does calling a function is a uniquely vulnerable spot to that?
There's nothing particularly special about calling your own functions that makes them uniquely vulnerable. If the function involves the kernel's API a thread switch can be more likely, and some things (e.g. calling "sleep()") are almost guaranteed to cause a thread switch.
Also there's potential interaction with virtual memory management - often things (e.g. your executable file, your code, your data) use "memory mapped files" where accessing it for the first time may cause OS to fetch the code or data from disk (and your thread can be blocked until the code or data it wanted arrived from disk); and rarely used code or data can also be sent to swap space and need to be fetched.
QuestionB. Could this be, in any way, due to the time measurement method I am using?
In practice it's likely that Windows' QueryPerformanceCounter() is implemented with an RDTSC instruction (assuming 80x86 CPU/s) and doesn't involve the kernel at all, and for modern hardware it's likely that this is monatomic. In theory Windows could emulate RDTSC and/or implement QueryPerformanceCounter() in another way to guard against security problems (timing side channels), as has been recommended by Intel for about 30 years now, but this is unlikely (modern operating systems, including but not limited to Windows, tend to care more about performance than security); and in theory your hardware/CPU could be so old (about 10+ years old) that Windows has to implement QueryPerformanceCounter() in a different way, or you could be using some other CPU (e.g. ARM and not 80x86).
In other words; it's unlikely (but not impossible) that the time measurement method you're using is causing any timing problems.

Set CPU affinity for profiling

I am working on a calculation intensive C# project that implements several algorithms. The problem is that when I want to profile my application, the time it takes for a particular algorithm varies. For example sometimes running the algorithm 100 times takes about 1100 ms and another time running 100 times takes much more time like 2000 or even 3000 ms. It may vary even in the same run. So it is impossible to measure improvement when I optimize a piece of code. It's just unreliable.
Here is another run:
So basically I want to make sure one CPU is dedicated to my app. The PC has an old dual core Intel E5300 CPU running on Windows 7 32 bit. So I can't just set process affinity and forget about one core forever. It would make the computer very slow for daily tasks. I need other apps to use a specific core when I desire and the when I'm done profiling, the CPU affinities come back to normal. Having a bat file to do the task would be a fantastic solution.
My question is: Is it possible to have a bat file to set process affinity for every process on windows 7?
PS: The algorithm is correct and every time runs the same code path. I created some object pool so after first run, zero memory is allocated. I also profiled memory allocation with dottrace and it showed no allocation after first run. So I don't believe GC is triggered when the algorithm is working. Physical memory is available and system is not running low on RAM.
Result: The answer by Chris Becke does the job and sets process affinities exactly as intended. It resulted in more uniform results specially when background apps like visual studio and dottrace are running. Further investigation into the divergent execution time revealed that the root for the unpredictability is CPU overheat. The CPU overheat alarm was off while the temperature was over 100C! So after fixing the malfunctioning fan, the results became completely uniform.
You mean SetProcessAffinityMask?
I see this question, while tagged windows, is c#, so... I see the System.Diagnostics.Process object has a ThreadAffinity member that should perform the same function.
I am just not sure that this will stabilize the CPU times quite in the way you expect. A single busy task that is not doing IO should remain scheduled on the same core anyway unless another thread interrupts it, so I think your variable times are more due to other threads / processes interrupting your algorithm than the OS randomly shunting your thread to a different core - so unless you set the affinity for all other threads in the system to exclude your preferred core I can't see this helping.

Parallel code slower than serial code (value function iteration example)

I'm trying to make the code faster in Julia using parallelization. My code has nested serial for-loops and performs value function iteration. (as decribed in http://www.parallelecon.com/vfi/)
The following link shows the serial and parallelized version of the code I wrote:
https://github.com/minsuc/MyProject/blob/master/VFI_parallel.ipynb (You can find the functions defined in DefinitionPara.jl in the github page too.) Serial code is defined as main() and parallel code is defined as main_paral().
The third for-loop in main() is the step where I find the maximizer given (nCapital, nProductivity). As suggested in the official parallel documentation, I distribute the work over nCapital grid, which consists of many points.
When I do #time for the serial and the parallel code, I get
Serial: 0.001041 seconds
Parallel: 0.004515 seconds
My questions are as follows:
1) I added two workers and each of them works for 0.000714 seconds and 0.000640 seconds as you can see in the ipython notebook. The reason why parallel code is slower is due to the cost of overhead?
2) I increased the number of grid points by changing
vGridCapital = collect(0.5*capitalSteadyState:0.000001:1.5*capitalSteadyState)
Even though each worker does significant amount of work, serial code is way faster than the parallel code. When I added more workers, serial code is still faster. I think something is wrong but I haven't been able to figure out... Could it be related to the fact that I pass too many arguments in the parallelized function
final_shared(mValueFunctionNew, mPolicyFunction, pparams, vGridCapital, mOutput, expectedValueFunction)?
I will really appreciate your comments and suggestions!
If the amount of work is really small between synchronizations, the task sync overhead may be too long. Remember that a common OS timeslicing quantum is 10ms, and you are measuring in the 1ms range, so with a bit of load, 4ms latency for getting all work threads synced is perfectly reasonable.
In the case of all tasks accessing the same shared data structure, access locking overhead may well be the culprit, if the shared data structure is thread safe, even with longer parallel tasks.
In some cases, it may be possible to use non-thread-safe shared arrays for both input and output, but then it must be ensured that the workers don't clobber each other's results.
Depending on what exactly the work threads are doing, for example if they are outputting to the same array elements, it might be necessary to give each worker its own output array, and merge them together in the end, but that doesn't seem to be the case with your task.

If a CPU is always executing instructions how do we measure its work?

Let us say we have a fictitious single core CPU with Program Counter and basic instruction set such as Load, Store, Compare, Branch, Add, Mul and some ROM and RAM. Upon switching on it executes a program from ROM.
Would it be fair to say the work the CPU does is based on the type of instruction it's executing. For example, a MUL operating would likely involve more transistors firing up than say Branch.
However from an outside perspective if the clock speed remains constant then surely the CPU could be said to be running at 100% constantly.
How exactly do we establish a paradigm for measuring the work of the CPU? Is there some kind of standard metric perhaps based on the type of instructions executing, the power consumption of the CPU, number of clock cycles to complete or even whether it's accessing RAM or ROM.
A related second question is what does it mean for the program to "stop". Usually does it just branch in an infinite loop or does the PC halt and the CPU waits for an interupt?
First of all, that a CPU is always executing some code is just an approximation these days. Computer systems have so-called sleep states which allow for energy saving when there is not too much work to do. Modern CPUs can also throttle their speed in order to improve battery life.
Apart from that, there is a difference between the CPU executing "some work" and "useful work". The CPU by itself can't tell, but the operating system usually can. Except for some embedded software, a CPU will never be running a single job, but rather an operating system with different processes within it. If there is no useful process to run, the Operating System will schedule the "idle task" which mostly means putting the CPU to sleep for some time (see above) or jsut burning CPU cycles in a loop which does nothing useful. Calculating the ratio of time spent in idle task to time spent in regular tasks gives the CPU's business factor.
So while in the old days of DOS when the computer was running (almost) only a single task, it was true that it was always doing something. Many applications used so-called busy-waiting if they jus thad to delay their execution for some time, doing nothing useful. But today there will almost always be a smart OS in place which can run the idle process than can put the CPU to sleep, throttle down its speed etc.
Oh boy, this is a toughie. It’s a very practical question as it is a measure of performance and efficiency, and also a very subjective question as it judges what instructions are more or less “useful” toward accomplishing the purpose of an application. The purpose of an application could be just about anything, such as finding the solution to a complex matrix equation or rendering an image on a display.
In addition, modern processors do things like clock gating in power idle states. The oscillator is still producing cycles, but no instructions execute due to certain circuitry being idled due to cycles not reaching them. These are cycles that are not doing anything useful and need to be ignored.
Similarly, modern processors can execute multiple instructions simultaneously, execute them out of order, and predict and execute which instructions will be executed next before your program (i.e. the IP or Instruction Pointer) actually reaches them. You don’t want to include instructions whose execution never actually complete, such as because the processor guesses wrong and has to flush those instructions, e.g. as due to a branch mispredict. So a better metric is counting those instructions that actually complete. Instructions that complete are termed “retired”.
So we should only count those instructions that complete (i.e. retire), and cycles that are actually used to execute instructions (i.e. unhalted).)
Perhaps the most practical general metric for “work” is CPI or cycles-per-instruction: CPI = CPU_CLK_UNHALTED.CORE / INST_RETIRED.ANY. CPU_CLK_UNHALTED.CORE are cycles used to execute actual instructions (vs those “wasted” in an idle state). INST_RETIRED are those instructions that complete (vs those that don’t due to something like a branch mispredict).
Trying to get a more specific metric, such as the instructions that contribute to the solution of a matrix multiple, and excluding instructions that don’t directly contribute to computing the solution, such as control instructions, is very subjective and difficult to gather statistics on. (There are some that you can, such as VECTOR_INTENSITY = VPU_ELEMENTS_ACTIVE / VPU_INSTRUCTIONS_EXECUTED which is the number of SIMD vector operations, such as SSE or AVX, that are executed per second. These instructions are more likely to directly contribute to the solution of a mathematical solution as that is their primary purpose.)
Now that I’ve talked your ear off, check out some of the optimization resources at your local friendly Intel developer resource, software.intel.com. Particularly, check out how to effectively use VTune. I’m not suggesting you need to get VTune though you can get a free or very discounted student license (I think). But the material will tell you a lot about increasing your programs performance (i.e. optimizing), which is, if you think about it, increasing the useful work your program accomplishes.
Expanding on Michał's answer a bit:
Program written for modern multi-tasking OSes are more like a collection of event handlers: they effectively setup listeners for I/O and then yield control back to the OS. The OS wake them up each time there is something to process (e.g. user action, data from device) and they "go to sleep" by calling into the OS once they've finished processing. Most OSes will also preempt in case one process hog the CPU for too long and starve the others.
The OS can then keep tabs on how long each process are actually running (by remembering the start and end time of each run) and generate the statistics like CPU time and load (ready process queue length).
And to answer your second question:
To stop mostly means a process is no longer scheduled and all associated resource (scheduling data structures, file handles, memory space, ...) destroyed. This usually require the process to call a special OS call (syscall/interrupt) so the OS can release the resources gracefully.
If however a process run into an infinite loop and stops responding to OS events, then it can only be forcibly stopped (by simply not running it anymore).

Runtime of GPU-based simulation unexplainable?

I am developing a GPU-based simulation using OpenGL and GLSL-Shaders and i found that performance increases when I add additional (unnecessary) GL-commands.
The simulation runs entirely on GPU without any transfers and basically consists of a loop performing 2500 algorithmically identical time steps. I carefully implemented caching of GLSL-uniform locations and removed any GL-state requests (glGet* etc) to maximize speed. To measure wall clock time i've put a glFinish after the main loop and take the elapsed time afterwards.
CASE A:
Normal total runtime for all iterations is 490ms.
CASE B:
Now, if i add a single additional glGetUniformLocation(...) command at the end of EACH time step, it requires only 475ms in total, which is 3 percent faster. (Please note that this is relevant to me since later i will perform a lot more timesteps)
I've looked at a timeline captured with Nvidia nsight and found that, in case A, all opengl commands are issued within the first 140ms and the glFinish takes 348ms until completion of all GPU-work. In case B the issuing of opengl commands is spread out over a significantly longer time (410ms) and the glFinish only takes 64ms yielding the faster 475ms in total.
I also noticed, that hardware command queue is much more full of work packets most of the time in case B, whereas in case A there is only one item waiting most of the time (however, there are no visible idle times).
So my questions are:
Why is B faster?
Why are the command packages issued more uniformly to the hardware queue over time in case A?
How can speed be enhanced without adding additional commands?
I am using Visual c++, VS2008 on Win7 x64.
IMHO this question can not be answered definitely. For what it's worth I experimentally determined, that glFinish (and …SwapBuffers for that matter) have weird runtime time behavior. I'm currently developing my own VR rendering library and prior to that I spend some significant time profiling the timelines of OpenGL commands and their interaction with the graphics system. And what I found out was, that the only thing that's consistent is, that glFinish + …SwapBuffers have a very inconsistent timing behavior.
What could happen is, that this glGetUniformLocation call pulls the OpenGL driver into a "busy" state. If you call glFinish immediately afterwards it may use a different method for waiting (for example it may spin in a while loop waiting for a flag) for the GPU than if you just call glFinish (it may for example wait for a signal or a condition variable and is thus subject to the kernels scheduling behavior).

Resources