Execution time java program - performance

First of all, here's just something I'm curious about
I've made a little program which fills some templates with values and I noticed that every time I run it the execution time changes a little bit, it ranges from 0.550s to 0.600s. My CPU is running at 2.9GHZ if that could be useful.
The instructions are always the same, is it maybe something that has to do with physics or something more software oriented?

it has to do with java running on a virtual machine; even a c program might run different times slightly longer/shorter, also the operating system steers when a program has resources (cpu time, memory …) to be executed.

Related

Why do programs never execute in exactly the same time?

This is more of a generic, technical question. I'm just curious about what the main factors in determining how fast or slow a computer program runs are?
For example, when I time Python code, the runtime always varies by at least +/- 0.02 seconds
There are many reasons of execution time variance. Variation of ~200ms looks plausible for a python script that runs for seconds. Main contributors here would be OS/scheduler and memory/cache. OS will serve interrupts on a core your script is running, and on blocking system calls it will run the scheduler, which will run background tasks on that core. While these tasks are running, they will pollute L1,L2 and L3 caches so that some part of data and code of that python script will be evicted to RAM. So memory references will always take different time each run, because you can never reproduce memory footprint of background tasks that interrupted your script.
If you are running on Linux, you may try scheduling your script to a CPU that was offlined from scheduler using isolcpu= kernel boot option, so you have less noise from other processes. You'll have orders of magnitude less variation then, but there will be still some coming from using shared resources - memory controllers, IO buses, shared last level cache.

Set CPU affinity for profiling

I am working on a calculation intensive C# project that implements several algorithms. The problem is that when I want to profile my application, the time it takes for a particular algorithm varies. For example sometimes running the algorithm 100 times takes about 1100 ms and another time running 100 times takes much more time like 2000 or even 3000 ms. It may vary even in the same run. So it is impossible to measure improvement when I optimize a piece of code. It's just unreliable.
Here is another run:
So basically I want to make sure one CPU is dedicated to my app. The PC has an old dual core Intel E5300 CPU running on Windows 7 32 bit. So I can't just set process affinity and forget about one core forever. It would make the computer very slow for daily tasks. I need other apps to use a specific core when I desire and the when I'm done profiling, the CPU affinities come back to normal. Having a bat file to do the task would be a fantastic solution.
My question is: Is it possible to have a bat file to set process affinity for every process on windows 7?
PS: The algorithm is correct and every time runs the same code path. I created some object pool so after first run, zero memory is allocated. I also profiled memory allocation with dottrace and it showed no allocation after first run. So I don't believe GC is triggered when the algorithm is working. Physical memory is available and system is not running low on RAM.
Result: The answer by Chris Becke does the job and sets process affinities exactly as intended. It resulted in more uniform results specially when background apps like visual studio and dottrace are running. Further investigation into the divergent execution time revealed that the root for the unpredictability is CPU overheat. The CPU overheat alarm was off while the temperature was over 100C! So after fixing the malfunctioning fan, the results became completely uniform.
You mean SetProcessAffinityMask?
I see this question, while tagged windows, is c#, so... I see the System.Diagnostics.Process object has a ThreadAffinity member that should perform the same function.
I am just not sure that this will stabilize the CPU times quite in the way you expect. A single busy task that is not doing IO should remain scheduled on the same core anyway unless another thread interrupts it, so I think your variable times are more due to other threads / processes interrupting your algorithm than the OS randomly shunting your thread to a different core - so unless you set the affinity for all other threads in the system to exclude your preferred core I can't see this helping.

If a CPU is always executing instructions how do we measure its work?

Let us say we have a fictitious single core CPU with Program Counter and basic instruction set such as Load, Store, Compare, Branch, Add, Mul and some ROM and RAM. Upon switching on it executes a program from ROM.
Would it be fair to say the work the CPU does is based on the type of instruction it's executing. For example, a MUL operating would likely involve more transistors firing up than say Branch.
However from an outside perspective if the clock speed remains constant then surely the CPU could be said to be running at 100% constantly.
How exactly do we establish a paradigm for measuring the work of the CPU? Is there some kind of standard metric perhaps based on the type of instructions executing, the power consumption of the CPU, number of clock cycles to complete or even whether it's accessing RAM or ROM.
A related second question is what does it mean for the program to "stop". Usually does it just branch in an infinite loop or does the PC halt and the CPU waits for an interupt?
First of all, that a CPU is always executing some code is just an approximation these days. Computer systems have so-called sleep states which allow for energy saving when there is not too much work to do. Modern CPUs can also throttle their speed in order to improve battery life.
Apart from that, there is a difference between the CPU executing "some work" and "useful work". The CPU by itself can't tell, but the operating system usually can. Except for some embedded software, a CPU will never be running a single job, but rather an operating system with different processes within it. If there is no useful process to run, the Operating System will schedule the "idle task" which mostly means putting the CPU to sleep for some time (see above) or jsut burning CPU cycles in a loop which does nothing useful. Calculating the ratio of time spent in idle task to time spent in regular tasks gives the CPU's business factor.
So while in the old days of DOS when the computer was running (almost) only a single task, it was true that it was always doing something. Many applications used so-called busy-waiting if they jus thad to delay their execution for some time, doing nothing useful. But today there will almost always be a smart OS in place which can run the idle process than can put the CPU to sleep, throttle down its speed etc.
Oh boy, this is a toughie. It’s a very practical question as it is a measure of performance and efficiency, and also a very subjective question as it judges what instructions are more or less “useful” toward accomplishing the purpose of an application. The purpose of an application could be just about anything, such as finding the solution to a complex matrix equation or rendering an image on a display.
In addition, modern processors do things like clock gating in power idle states. The oscillator is still producing cycles, but no instructions execute due to certain circuitry being idled due to cycles not reaching them. These are cycles that are not doing anything useful and need to be ignored.
Similarly, modern processors can execute multiple instructions simultaneously, execute them out of order, and predict and execute which instructions will be executed next before your program (i.e. the IP or Instruction Pointer) actually reaches them. You don’t want to include instructions whose execution never actually complete, such as because the processor guesses wrong and has to flush those instructions, e.g. as due to a branch mispredict. So a better metric is counting those instructions that actually complete. Instructions that complete are termed “retired”.
So we should only count those instructions that complete (i.e. retire), and cycles that are actually used to execute instructions (i.e. unhalted).)
Perhaps the most practical general metric for “work” is CPI or cycles-per-instruction: CPI = CPU_CLK_UNHALTED.CORE / INST_RETIRED.ANY. CPU_CLK_UNHALTED.CORE are cycles used to execute actual instructions (vs those “wasted” in an idle state). INST_RETIRED are those instructions that complete (vs those that don’t due to something like a branch mispredict).
Trying to get a more specific metric, such as the instructions that contribute to the solution of a matrix multiple, and excluding instructions that don’t directly contribute to computing the solution, such as control instructions, is very subjective and difficult to gather statistics on. (There are some that you can, such as VECTOR_INTENSITY = VPU_ELEMENTS_ACTIVE / VPU_INSTRUCTIONS_EXECUTED which is the number of SIMD vector operations, such as SSE or AVX, that are executed per second. These instructions are more likely to directly contribute to the solution of a mathematical solution as that is their primary purpose.)
Now that I’ve talked your ear off, check out some of the optimization resources at your local friendly Intel developer resource, software.intel.com. Particularly, check out how to effectively use VTune. I’m not suggesting you need to get VTune though you can get a free or very discounted student license (I think). But the material will tell you a lot about increasing your programs performance (i.e. optimizing), which is, if you think about it, increasing the useful work your program accomplishes.
Expanding on Michał's answer a bit:
Program written for modern multi-tasking OSes are more like a collection of event handlers: they effectively setup listeners for I/O and then yield control back to the OS. The OS wake them up each time there is something to process (e.g. user action, data from device) and they "go to sleep" by calling into the OS once they've finished processing. Most OSes will also preempt in case one process hog the CPU for too long and starve the others.
The OS can then keep tabs on how long each process are actually running (by remembering the start and end time of each run) and generate the statistics like CPU time and load (ready process queue length).
And to answer your second question:
To stop mostly means a process is no longer scheduled and all associated resource (scheduling data structures, file handles, memory space, ...) destroyed. This usually require the process to call a special OS call (syscall/interrupt) so the OS can release the resources gracefully.
If however a process run into an infinite loop and stops responding to OS events, then it can only be forcibly stopped (by simply not running it anymore).

Can we time commands deterministically?

We know that in bash, time foo will tell us how long a command foo takes to execute. But there is so much variability, depending on unrelated factors including what else is running on the machine at the time. It seems like there should be some deterministic way of measuring how long a program takes to run. Number of processor cycles, perhaps? Number of pipeline stages?
Is there a way to do this, or if not, to at least get a more meaningful time measurement?
You've stumbled into a problem that's (much) harder than it appears. The performance of a program is absolutely connected to the current state of the machine in which it is running. This includes, but is not limited to:
The contents of all CPU caches.
The current contents of system memory, including any disk caching.
Any other processes running on the machine and the resources they're currently using.
The scheduling decisions the OS makes about where and when to run your program.
...the list goes on and on.
If you want a truly repeatable benchmark, you'll have to take explicit steps to control for all of the above. This means flushing caches, removing interference from other programs, and controlling how your job gets run. This isn't an easy task, by any means.
The good news is that, depending on what you're looking for, you might be able to get away with something less rigorous. If you run the job on your regular workload and it produces results in a good amount of time, then that might be all that you need.

(Un)Limit CPU usage - Windows - Mathematica

I'm programming in Mathematica 8.
When I run my programme, I check with Win8 task manager that the CPU usage is at 35% as soon as it starts to run, and my memory usage also increases to 44%. Does Win8 limit the amount of CPU usage that a certain programme may have? I need to make my computer to use more of its resources to run the programme faster.
Any help would be appreciated.
What's happening here is a common misconception about how processors approach problems involving heavy computation.
Although you may indeed have a powerful 4-core processing machine, and you have a program which is capable of using all 4 processing cores (which mathematica definitely is!), unless the code is written in a parallel fashion, you will only be able to use 1 core at a time to do the calculations. As Mysticial mentioned in the comment, not all code is parallelizable, in fact, I'd say a great many problems are not inherently able to be parallelized.
Check here for some good examples of problems that can be split up in a parallel fashion well. Now, your memory usage will simply increase with the size of the data that's being stored temporarily. (ex: storing a 69X69 matrix takes up less memory (RAM) than a 4000X4000, being parallel has little to do with this, and more with the problem itself).
Anyway, tl;dr, your computer is acting normally. To use all 100% of that 4-core machine you're using, check out This mathematica reference guide to parallel operations.

Resources