Computing time in relation to number of operations

Computing time in relation to number of operations - time

is it possible to calculate the computing time of a process based on the number of operations that it performs and the speed of the CPU in GHz?
For example, I have a for loop that performs a total number of 5*10^14 cycles. If it runs on a 2.4 GHz processor, will the computing time in seconds be: 5*10^14/2.4*10^9 = 208333 s?
If the process runs on 4 cores in parallel, will the time be reduced by four?
Thanks for your help.

No, it is not possible to calculate the computing time based just on the number of operations. First of all, based on your question, it sounds like you are talking about the number of lines of code in some higher-level programming language since you mention a for loop. So depending on the optimization level of your compiler, you could see varying results in computation time depending on what kinds of optimizations are done.
But even if you are talking about assembly language operations, it is still not possible to calculate the computation time based on the number of instructions and CPU speed alone. Some instructions might take multiple CPU cycles. If you have a lot of memory access, you will likely have cache misses and have to load data from disk, which is unpredictable.
Also, if the time that you are concerned about is the actual amount of time that passes between the moment the program begins executing and the time it finishes, you have the additional confounding variable of other processes running on the computer and taking up CPU time. The operating system should be pretty good about context switching during disk reads and other slow operations so that the program isn't stopped in the middle of computation, but you can't count on never losing some computation time because of this.
As far as running on four cores in parallel, a program can't just do that by itself. You need to actually write the program as a parallel program. A for loop is a sequential operation on its own. In order to run four processes on four separate cores, you will need to use the fork system call and have some way of dividing up the work between the four processes. If you divide the work into four processes, the maximum speedup you can have is 4x, but in most cases it is impossible to achieve the theoretical maximum. How close you get depends on how well you are able to balance the work between the four processes and how much overhead is necessary to make sure the parallel processes successfully work together to generate a correct result.

Related

Calculating execution time

I am trying to find out how long it takes to execute 10,000 RISC instructions with 4 bytes from a processor that is 2GHz and another that is 4GHz, I only need the very basics of a formula
I have tried 10,000 x 4 = 40,000 / 2x10^9 and 40,000 / 4x10^9

There isn't a correct way to calculate this. There are a number of dependencies and complexities:
What type of instructions are included? Instructions cycle counts can vary from 1 cycle to 20-30 cycles per instructions. How many of these instructions can be dispatched at once?
What is the memory access pattern and how is the CPU memory access designed? How effective will caching/pre-fetching be (and does the CPU support)?
Are there many branches? How predictable are those branches and how many are within the critical portion of the code? What is the cost of a miss-predict?
and more.
Fundamentally the question you are asking isn't easily solvable and absolutely depends on the code to be run.
Generally speaking, code execution does not scale linearly so it is unlikely that for anything non-trivial that a 4GHz processor will be twice as fast as a 2GHz processor.

What's the difference between parallel cost and parallel work?

I read a paper in which parallel cost for (parallel) algorithms is defined as CP(n) = p * TP(n), where p is the number of processors, T the processing time and n the input. An algorithm is cost-optimal, if CP(n) is approximately constant, i.e. if the algorithm uses two processors instead of one on the same input it takes only half the time.
There is another concept called parallel work, which I don't fully grasp.
The papers says it measures the number of executed parallel OPs.
A algorithm is work-optimal, if it performs as many OPs as it sequential counterpart (asymptotically). A cost-optimal algorithm is always work-optimal but not vice versa.
Can someone illustrate the concept of parallel work and show the similarities and differences to parallel cost?

It sounds like parallel work is simply a measure of the total number of instructions ran by all processes in parallel but counting the ones in parallel only once. If that's the case, then it's more closely related to the time term in your parallel cost equation. Think of it this way: if the parallel version of the algorithm runs more instructions than the sequential version --meaning it is not work-optimal, it will necessarily take more time assuming all instructions are equal in duration. Typically these extra instructions are at the beginning or end of the parallel algorithm and are viewed as overhead of the parallel algorithm. They can correspond to extra bookkeeping or communication or final aggregation of the result.
Thus an algorithm that is not work-optimal cannot be cost-optimal.

Another way to call the "parallel cost" is "cost of context switching" although it can also arise from interdependencies between the different threads.
Consider sorting.
If you implement Bubble Sort in parallel where each thread just picks up the next comparison you will have a huge cost to run it in "parallel", to the point where it will be essentially a messed up sequential version of the algorithm and your parallel work will be essentially zero because most threads just wait most of the time.
Now compare that to Quick Sort and implement a thread for each split of the original array - threads don't need data from other threads, and asymptotically for a bigger starting arrays the cost of spinning these threads will be paid by the parallel nature of the work done... if the system has infinite memory bandwidth. In reality it wouldn't be worth spinning more threads than there are memory access channels because the threads still have the invisible (from code perspective) dependency between them by having shared sequential access to memory

Short
I think parallel cost and parallel work are two sides of the same coin. They're both measures for speed-up,
whereby the latter is the theoretical concept enabling the former.
Long
Let's consider n-dimensional vector addition as a problem that is easy to parallelize, since it can be broken down into n independent tasks.
The problem is inherently work-optimal, because the parallel work doesn't change if the algorithm runs in parallel, there are always n vector components that need to be added.
Considering the parallel cost cannot be done without executing the algorithm on a (virtual) machine, where practical limitations like shortage of memory bandwidth arise. Thus a work-optimal algorithm can only be cost-optimal if the hardware (or the hardware access patterns) allows the perfect execution and division of the problem - and the time.
Cost-optimality is a stronger demand, and as I'm realizing now, just another illustration of efficiency
Under normal circumstances a cost-optimal algorithm will also be work-optimal,
but if the speed-up gained by caching, memory access patterns etc. is super-linear,
i.e. the execution times with two processors is one-tenth instead of the expected half, it is possible for an algorithm that performs more work, and thus is not work-optimal, to still be cost-optimal.

Estimate parallel efficiency using unicore processor

We know that the parallel efficiency of a program running on a multicore system can be calculated as speedup/N where N is the number of cores. So in order to use this formula first we need to execute the code on a multicore system and need to know the speedup.
I would like to know if I don't have a multicore system,then is it possible to estimate the speedup of the given code on a multicore system just by executing it on an unicore processor?
I have access to performance counters ( Instruction per cycles , number of cache misses , Number of Instructions etc) and I only have binaries of the code.
[Note: I estimated the parallel_running_time (T_P) = serial_running_time/N but this estimation has unacceptable error]
Thanks

Read up on Amdahl's Law, especially the bit about parallelization.
For you to determine how much you can speed up your program, you have to know what parts of the program can benefit from parallelization and what parts must be executed sequentially. If you know that, and if you know how long the serial and the parallel parts take (individually) on a single processor, then you can estimate how fast the program will be on multiple processors.
From your description, it seems that you don't know which parts can make use of parallel processing and which parts have to be executed sequentially. So it won't be possible to estimate the parallel running time.

How to measure interference between processes

In parallel systems every process has an impact onto other processes, because they all compete for several scarce resources like cpu-caches, memory, disk I/O, network, etc.
What method is best suited for measuring interference between processes? Such as Process A & B each access the disk heavily. So running them parallel will probably slower then running sequential (individual runtime). Because the bottleneck is the hard drive.
If I don't know exactly the behaviour of a process (disk-, memory- or cpu- intensive), what method would be best to analyse that?
Measure individual runtime and compare the relative share of each parallel process?
Like process A runs on average 30s alone, when 100% parallel with B 45s, when 20% parallel 35s.. etc ??
Would it be better to compare several indicators like L1 & LLC cache misses, page faults, etc.??

What you need to do is first determine what the limiting factors are on each of the individual programs. If you want to run CPU-bound and IO-bound at the same time it'll have very little impact. If you want to run two IO-bound processes and the same time there'll be a lot of contention.
I wrote a rather detailed answer about how to interpret the output of "time [command]" results to see what's the limiting factor. It's here: What caused my elapsed time much longer than user time?
Once you have the ouput from "time"ing your programs you can determine which are likely to step on one another and which are not.

Cost/benefit of parallelization based on code size?

How do you figure out whether it's worth parallelizing a particular code block based on its code size? Is the following calculation correct?
Assume:
Thread pool consisting of one thread per CPU.
CPU-bound code block with execution time of X milliseconds.
Y = min(number of CPUs, number of concurrent requests)
Therefore:
Cost: code complexity, potential bugs
Benefit: (X * Y) milliseconds
My conclusion is that it isn't worth parallelizing for small values of X or Y, where "small" depends on how responsive your requests must be.

One thing that will help you figure that out is Amdahl's Law
The speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program. For example, if a program needs 20 hours using a single processor core, and a particular portion of 1 hour cannot be parallelized, while the remaining promising portion of 19 hours (95%) can be parallelized, then regardless of how many processors we devote to a parallelized execution of this program, the minimal execution time cannot be less than that critical 1 hour. Hence the speed up is limited up to 20x.
Figure out what you want to achieve in speed up, and how much parallelism you can actually achieve, then see if its worth it.

It depends on many factors, as the difficulty of parallelize the code, the speedup obtained from it (there are overhead costs on dividing the problem and joining the results) and the amount of time that the code is spending there (Amdahl's Law)

Well, the benefit is really more:
(X * (Y-1)) * Tc * Pf
Where Tc is the cost of the threading framework you are using. No threading framework scales perfectly, so using 2x threads will likely be, at best, 1.9x speed.
Pf is some factor for parallization that depends completely on the algorithm (ie: whether or not you'll need to lock, which will slow the process down).
Also, it's Y-1, since single threaded is basically assuming Y==1.
As for deciding, it's also a matter of user frustration/expectation (if they user is annoyed at waiting for something, it'd have a greater benefit than a task that the user doesn't really mind - which is not always just due to wait times, etc - it's partly expectations).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio