$time meaning when using parallel processing on a scientific cluster? - time

I'm running my finite-difference program on a scientific cluster at my school. The program uses openmpi to parallelize the code.
When time the program in serial I get:
real 78m40.592s
user 78m34.920s
sys 0m0.999s
When I run it with 8 mpi processors I get:
real 12m45.929s
user 101m9.271s
sys 0m29.735s
When I run it with 16 mpi processors I get:
real 4m46.936s
user 37m30.000s
sys 0m1.150s
So my question is: if the user time is the total CPU time, then why are the user times so different from each other for different numbers of processors?
Thanks,
Anthony G.

In serial, your code runs in 78m40s and real and user are almost identical.
When you run with 8 processes, which I would assumed are all running on the same machine (node), the total cpu time is 101m9. It is much larger, I would guess that you have encountered either overloading of the node or memory overconsumption. But as you are using 8 cores, the total wall clock time is 101m9 / 8 = 12m45. You could try to rerun that test and observe what happens.
When you run with 16 processes, which I would assumed are dispatched on two nodes, the real time is 4m46, which is approximately 78m40 / 16. But the real time is the cumulated cpu time of all processing running on the same node as mpirun ; the time command has no way of knowing about mpi processes running on other nodes; 37m30 is approximately 78m40 / 2.

There are usually two different notions of time on a computer system.
Wall-clock time (let's call it T): This is the time that goes by on your watch, while your program is executing.
CPU time (let's call it C): that's the cumulative time all the CPUs working on your program have spent executing your code.
For an ideal parallel code running on P CPUs, T=C/P. That means, if you run the code on eight CPUs, the code is eight times faster, but the work has been distributed to eight CPUs, which all need to execute for C/P seconds/minutes.
In reality, there's often overhead in the execution. With MPI, you've got communication overhead. This usually cause a situation where T>C/P. The higher T becomes, the less efficient the parallel code is.
An operating system like Linux can tell you more things, than just the wall-clock time. It usually reports user and sys time. User time is the CPU time (not exactly, but reasonable close for now) that the application spends in your code. Sys time is the time in Linux kernel.
Cheers,
-michael

Related

Why do programs never execute in exactly the same time?

This is more of a generic, technical question. I'm just curious about what the main factors in determining how fast or slow a computer program runs are?
For example, when I time Python code, the runtime always varies by at least +/- 0.02 seconds
There are many reasons of execution time variance. Variation of ~200ms looks plausible for a python script that runs for seconds. Main contributors here would be OS/scheduler and memory/cache. OS will serve interrupts on a core your script is running, and on blocking system calls it will run the scheduler, which will run background tasks on that core. While these tasks are running, they will pollute L1,L2 and L3 caches so that some part of data and code of that python script will be evicted to RAM. So memory references will always take different time each run, because you can never reproduce memory footprint of background tasks that interrupted your script.
If you are running on Linux, you may try scheduling your script to a CPU that was offlined from scheduler using isolcpu= kernel boot option, so you have less noise from other processes. You'll have orders of magnitude less variation then, but there will be still some coming from using shared resources - memory controllers, IO buses, shared last level cache.

Embarrasingly parallel execution, no speedup (MEEP, openMPI)

I've been trying to exploit parallelization to run some simulations with the MEEP simulation software a bit faster. By default the software only uses one CPU, and FDTD simulations are easily sped up by parallelization. In the end I found there was no difference between running 1 or 4 cores, the simulation times were the same.
I then figured I would instead run individual simulations on each core to increase my total simulation throughput (for example running 4 different simulations at the same time).
What I found surprising is that whenever I start a new simulation, the already started simulations would slow down, even though they run on separate cores. For example, if I run only 1 simulation on 1 core, each time step of the FDTD simulation takes around 0.01 seconds. If I start another process on another core, each simulation now spends 0.02 seconds per time step, and so on, meaning that even when I run different simulations that have nothing to do with each other on separate cores, they all slow down giving me no net increase in speed.
I'm not necessarily looking for help to solve this problem as much as I'm looking for help understanding it, because it peaked my curiosity. Each instance of the simulation requires less than 1% of my total memory, so it's not a memory issue. The only thing I can think of is the cores sharing the cache memory, or the memory bandwidth being saturated, is there any way to check if this is the case?
The simulations are fairly simple and I've ran programs which are much more memory hungry than this one and had great speedup with parallelization.
Any tips to help me understand this phenomena?
I think it should be better look on bigger simulations because the well known issue with the turbo boost like technology (the single core performance change with the number of threads) cannot explain your result. It will explain just if have a single core processor.
So, I think that can be explain with memory cache levels. Maybe if you try simulations much bigger than L3 Cache (> 8MB for i7).
MY test on a Intel(R) Core(TM) i7-3517U CPU # 1.90GHz Dual Core (4 Threads). All simulations for 1 mpi thread (-np 1)
10mb simulation:
Four simulation 0.0255 s/step
Two simulation 0.0145 s/step
One simulation 0.0129 s/step
100mb simulation:
Four simulation 1.13 s/step
Two simulation 0.61 s/step
One simulation 0.53 s/step
A curious thing is that two simulation with 2 threads each run at almost the same speed as two simulations with 1 thread.

Computing time in relation to number of operations

is it possible to calculate the computing time of a process based on the number of operations that it performs and the speed of the CPU in GHz?
For example, I have a for loop that performs a total number of 5*10^14 cycles. If it runs on a 2.4 GHz processor, will the computing time in seconds be: 5*10^14/2.4*10^9 = 208333 s?
If the process runs on 4 cores in parallel, will the time be reduced by four?
Thanks for your help.
No, it is not possible to calculate the computing time based just on the number of operations. First of all, based on your question, it sounds like you are talking about the number of lines of code in some higher-level programming language since you mention a for loop. So depending on the optimization level of your compiler, you could see varying results in computation time depending on what kinds of optimizations are done.
But even if you are talking about assembly language operations, it is still not possible to calculate the computation time based on the number of instructions and CPU speed alone. Some instructions might take multiple CPU cycles. If you have a lot of memory access, you will likely have cache misses and have to load data from disk, which is unpredictable.
Also, if the time that you are concerned about is the actual amount of time that passes between the moment the program begins executing and the time it finishes, you have the additional confounding variable of other processes running on the computer and taking up CPU time. The operating system should be pretty good about context switching during disk reads and other slow operations so that the program isn't stopped in the middle of computation, but you can't count on never losing some computation time because of this.
As far as running on four cores in parallel, a program can't just do that by itself. You need to actually write the program as a parallel program. A for loop is a sequential operation on its own. In order to run four processes on four separate cores, you will need to use the fork system call and have some way of dividing up the work between the four processes. If you divide the work into four processes, the maximum speedup you can have is 4x, but in most cases it is impossible to achieve the theoretical maximum. How close you get depends on how well you are able to balance the work between the four processes and how much overhead is necessary to make sure the parallel processes successfully work together to generate a correct result.

System performance analysis

I am trying to understand how performance can be measured using the time command in Unix systems. Lets say I run the time command for three different machines and get the following results:
A: 282u(user cpu time) 3S(system cpu time) 4:45(elapsed time) 99%
B: 238u 5S 4:13 98%
C: 302u 9S 5.11 97%
Which system will have the highest performance?
man time tells, user time is how long your program spent on CPU and System time is time spent in kernel performing privilege operations such as I/O calls read, write behalf of your program. Therefore User + System time in case of machine A is smaller compared to other machines, resulting best performance out of all three machines.
Elapsed time is the time measured by wall clock i.e time measured from process spawned and it terminates. Though it is nothing to do with CPU usage.

how would the number of parallel processes affect the performance of CPU?

I am writing a parallel merge sort program. I use fork() to perform the parallel processing. I tried running 2 parallel processes, 4 processes, 8 processes and so on. Then I found that the one running with 2 processes required the least time to finish, i.e the highest performance. I think it's reasonable as my cpu is core 2 duo. For 4,8,16,32 processes, it seems to have a steady declining of performance, but after that the performance fluctuate (doesn't seem to have a pattern). Can someone explain that?
Plus according to the pattern, I have a feeling that when the number of processes used in the program is equal to the number of core that my cpu has, the my program could have the highest performance. But I am 100% sure. Can someone verify me? Or tell me what actually affect the performance of a parallel program.
Thanks in advance!!
With 2 cores any number of processes greater than 2 will have to share the processor time. You will incur overhead from process switching and you will never have more than two processes executing at one time. It is better to have just two processes run uninterrupted on your two cores.
As to why you were seeing a fluctuation in performance once you hit a large number of processes I'd have to make a guess that your OS is spending more time task switching between the processes than actually performing work doing the sort. The time it takes to switch tasks is an artifact of your OS's scheduler, amount of memory being used by individual tasks, caching, potential use of swap space, etc...
If you want to maximize the performance of parallel processes, the number of processes running concurrently should equal the number of processors times the number of cores on each processor. In your case, two. Any less then you have cores sitting idle not doing anything, any more, you have processes sitting idle waiting for time on a processor core.
3 processes should never be faster than 2 processes on a Core 2 Duo.
Also, forking only makes sense if you're doing CPU-expensive tasks:
Forking to print the message Hello world! twice is nonsense. The forking itself will consume more CPU-time than it could possibly save.
Forking to sort an array with 1,000,000 elements (if you use the proper sorting algorithm) will cut execution time roughly in half.

Resources