System performance analysis - cpu

I am trying to understand how performance can be measured using the time command in Unix systems. Lets say I run the time command for three different machines and get the following results:
A: 282u(user cpu time) 3S(system cpu time) 4:45(elapsed time) 99%
B: 238u 5S 4:13 98%
C: 302u 9S 5.11 97%
Which system will have the highest performance?

man time tells, user time is how long your program spent on CPU and System time is time spent in kernel performing privilege operations such as I/O calls read, write behalf of your program. Therefore User + System time in case of machine A is smaller compared to other machines, resulting best performance out of all three machines.
Elapsed time is the time measured by wall clock i.e time measured from process spawned and it terminates. Though it is nothing to do with CPU usage.

Related

How to calculate CPU service time on Windows and Linus OS for a running application

How to calculate CPU service time on Windows and Linux OS for a running application? I believe this can be calculated as total time of running application multiply by % utilization of CPU but not sure. Also, What is CPU time and How CPU time is different than Service time?
The windows task manager can show the cpu time (might have to enable it in the menu). In linux running the application with time application gives you the cpu time after the application has finished and I guess top or htop can show it for a running application.
The cpu-time is the time used by the cpu(s) to process the instructions of the application. So for the given cpu-time the application used 100% of a CPU.
The usage of the CPU for a wall clock time intervall would be (sum of all cpu times)/(wall clock time) i.e if 10 application have 0.1s of cpu time in a frame of 1s the total utilization would be 100%.
CPU utilization for a given application would be (cpu time)/(wall clock time) for a single CPU or (cpu time)/(#CPUs * wall clock time) if it uses multiple CPUs.
So yes cpu-time would be wall-clock-time*%CPU utilization.
The diffence between CPU time and service time (called wall clock time above) is that service time is the time elapsed since the start of the application and the cpu time is the time it could/did actually use a CPU.

What exactly is CPU load if instructions are executed one at a time?

I know this question has been asked many times in many different manners, but it's still not clear for me what the CPU load % means.
I'll start explaining how I perceive the concepts now (of course, I might, and sure will, be wrong):
A CPU core can only execute one instruction at a time. It will not execute the next instruction until it finishes executing the current one.
Suppose your box has one single CPU with one single core. Parallel computing is hence not possible. Your OS's scheduler will pick up a process, set the IP to the entry point, and send that instruction to the CPU. It won't move to the next instruction until the CPU finishes executing the current instruction. After a certain amount of time it will switch to another process, and so on. But it will never switch to another process if the CPU is currently executing an instruction. It will wait until the CPU becomes free to switch to another process. Since you only have one single core, you can't have two processes executing simultaneously.
I/O is expensive. Whenever a process wants to read a file from the disk, it has to wait until the disk accomplishes its task, and the current process can't execute its next instruction until then. The CPU is not doing anything while the disk is working, and so our OS will switch to another process until the disk finishes its job in order not to waste time.
Following these principles, I've come myself to the conclusion that CPU load at a given time can only be one of the following two values:
0% - Idle. CPU is doing nothing at all.
100% - Busy. CPU is currently executing an instruction.
This is obviously false as taskmgr reports %1, 12%, 15%, 50%, etc. CPU usage values.
What does it mean that a given process, at a given time, is utilizing 1% of a given CPU core (as reported by taskmgr)? While that given process is executing, what happens with the 99%?
What does it mean that the overall CPU usage is 19% (as reported by Rainmeter at the moment)?
If you look into the task manager on Windows there is Idle process, that does exactly that, it just shows amount of cycles not doing anything useful. Yes, CPU is always busy, but it might be just running in a loop waiting for useful things to come.
Since you only have one single core, you can't have two processes
executing simultaneously.
This is not really true. Yes, true parallelism is not possible with single core, but you can create illusion of one with preemptive multitasking. Yes, it is impossible to interrupt instruction, but it is not a problem because most of the instructions require tiny amount of time to finish. OS shares time with time slices, which are significantly longer than execution time of single instruction.
What does it mean that a given process, at a given time, is utilizing 1% of a given CPU core
Most of the time applications are not doing anything useful. Think of application that waits for user to click a button to start processing something. This app doesn't need CPU, so it sleeps most of the time, or every time it gets time slice it just goes into sleep (see event loop in Windows). GetMessage is blocking, so it means that thread will sleep until message arrives. So what CPU load really means? So imagine the app receives some events or data to do things, it will do operations instead of sleeping. So if it utilizes X% of CPU means that over sampling period of time that app used X% of CPU time. CPU time usage is average metric.
PS: To summarize concept of CPU load, think of speed (in terms of physics). There are instantaneous and average speeds, so speaking of CPU load, there also are instantaneous and average measurements. Instantaneous is always equal to either 0% or 100%, because at some point of time process either uses CPU or not. If process used 100% of CPU in the course of 250ms and didn't use for next 750ms then we can say that process loaded CPU for 25% with sampling period of 1 second (average measurement can only be applied with certain sampling period).
http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages
A single-core CPU is like a single lane of traffic. Imagine you are a bridge operator ... sometimes your bridge is so busy there are cars lined up to cross. You want to let folks know how traffic is moving on your bridge. A decent metric would be how many cars are waiting at a particular time. If no cars are waiting, incoming drivers know they can drive across right away. If cars are backed up, drivers know they're in for delays.
This is basically what CPU load is. "Cars" are processes using a slice of CPU time ("crossing the bridge") or queued up to use the CPU. Unix refers to this as the run-queue length: the sum of the number of processes that are currently running plus the number that are waiting (queued) to run.
Also see: http://en.wikipedia.org/wiki/Load_(computing)

$time meaning when using parallel processing on a scientific cluster?

I'm running my finite-difference program on a scientific cluster at my school. The program uses openmpi to parallelize the code.
When time the program in serial I get:
real 78m40.592s
user 78m34.920s
sys 0m0.999s
When I run it with 8 mpi processors I get:
real 12m45.929s
user 101m9.271s
sys 0m29.735s
When I run it with 16 mpi processors I get:
real 4m46.936s
user 37m30.000s
sys 0m1.150s
So my question is: if the user time is the total CPU time, then why are the user times so different from each other for different numbers of processors?
Thanks,
Anthony G.
In serial, your code runs in 78m40s and real and user are almost identical.
When you run with 8 processes, which I would assumed are all running on the same machine (node), the total cpu time is 101m9. It is much larger, I would guess that you have encountered either overloading of the node or memory overconsumption. But as you are using 8 cores, the total wall clock time is 101m9 / 8 = 12m45. You could try to rerun that test and observe what happens.
When you run with 16 processes, which I would assumed are dispatched on two nodes, the real time is 4m46, which is approximately 78m40 / 16. But the real time is the cumulated cpu time of all processing running on the same node as mpirun ; the time command has no way of knowing about mpi processes running on other nodes; 37m30 is approximately 78m40 / 2.
There are usually two different notions of time on a computer system.
Wall-clock time (let's call it T): This is the time that goes by on your watch, while your program is executing.
CPU time (let's call it C): that's the cumulative time all the CPUs working on your program have spent executing your code.
For an ideal parallel code running on P CPUs, T=C/P. That means, if you run the code on eight CPUs, the code is eight times faster, but the work has been distributed to eight CPUs, which all need to execute for C/P seconds/minutes.
In reality, there's often overhead in the execution. With MPI, you've got communication overhead. This usually cause a situation where T>C/P. The higher T becomes, the less efficient the parallel code is.
An operating system like Linux can tell you more things, than just the wall-clock time. It usually reports user and sys time. User time is the CPU time (not exactly, but reasonable close for now) that the application spends in your code. Sys time is the time in Linux kernel.
Cheers,
-michael

Is CPU time relevant to Hyperthreading?

Is increased CPU time (as reported by time CLI command) indicative of inefficiency when hyperthreading is used (e.g. time spent in spinlocks or cache misses) or is it possible that the CPU time is inflated by the odd nature of HT? (e.g. real cores being busy and HT can't kick in)
I have quad-core i7, and I'm testing trivially-parallelizable part (image to palette remapping) of an OpenMP program — with no locks, no critical sections. All threads access a bit of read-only shared memory (look-up table), but write only to their own memory.
cores real CPU
1: 5.8 5.8
2: 3.7 5.9
3: 3.1 6.1
4: 2.9 6.8
5: 2.8 7.6
6: 2.7 8.2
7: 2.6 9.0
8: 2.5 9.7
I'm concerned that amount of CPU time used increases rapidly as number of cores exceeds 1 or 2.
I imagine that in an ideal scenario CPU time wouldn't increase much (same amount of work just gets distributed over multiple cores).
Does this mean there's 40% of overhead spent on parallelizing the program?
It's quite possibly an artefact of how CPU time is measured. A trivial example, if you run a 100 MHz CPU and a 3 GHz CPU for one second each, each will report that it ran for one second. The second CPU might do 30 times more work, but it takes one second.
With hyperthreading, a reasonable (not quite accurate) model would be that one core can run either one task at lets say 2000 MHz, or two tasks at lets say 1200 MHz. Running two tasks it does only 60% of the work per thread, but 120% of the work for both threads together, a 20% improvement. But if the OS asks how many seconds of CPU time was used, the first will report "1 second" after each second on real time, while the second will report "2 seconds".
So the reported CPU time goes up. If it less than doubles, overall performance is improved.
Quick question - are you running the genuine time program /usr/bin/time, or the built in bash command of the same name? I'm not sure that matters, they look very similar.
Looking at your table of numbers I sense that the processed data set (ie input plus all the out data) is reasonably large overall (bigger than L2 cache), and that the processing per data item is not that lengthy.
The numbers show a nearly linear improvement from 1 to 2 cores, but that is tailing off significantly by the time you're using 4 cores. The hyoerthreaded cores are adding virtually nothing. This means that something shared is being contended for. Your program has free running threads, so that thing can only be memory (L3 cache and main memory on the i7).
This sounds like a typical example of being I/O bound rather than compute bound, the I/O in this case being to/from L3 cache and main memory. L2 cache is 256k, so I'm guessing that the size of your input data plus one set of results and all intermediate arrays is bigger than 256k.
Am I near the mark?
Generally speaking when considering how many threads to use you have to take shared cache and memory speeds and data set sizes into account. That can be a right be a right bugger because you have to work it out at run time, which is a lot of programming effort (unless your hardware config is fixed).

User CPU time vs System CPU time?

Could you explain more about "user CPU time" and "system CPU time"? I have read a lot, but I couldn't understand it well.
The difference is whether the time is spent in user space or kernel space. User CPU time is time spent on the processor running your program's code (or code in libraries); system CPU time is the time spent running code in the operating system kernel on behalf of your program.
User CPU Time: Amount of time the processor worked on the specific program.
System CPU Time: Amount of time the processor worked on operating system's functions connected to that specific program.
The term ‘user CPU time’ can be a bit misleading at first. To be clear, the total time (real CPU time) is the combination of the amount of time the CPU spends performing some action for a program and the amount of time the CPU spends performing system calls for the kernel on the program’s behalf. When a program loops through an array, it is accumulating user CPU time. Conversely, when a program executes a system call such as exec or fork, it is accumulating system CPU time.
Based on wikipedia:
User time is the amount of time the CPU was busy executing code in user space.
System time is the amount of time the CPU was busy executing code in kernel space. If this value is reported for a thread or
process, then it represents the amount of time the kernel was doing
work on behalf of the executing context, for example, after a thread
issued a system call.

Resources