Calculating execution time

Calculating execution time - processor

I am trying to find out how long it takes to execute 10,000 RISC instructions with 4 bytes from a processor that is 2GHz and another that is 4GHz, I only need the very basics of a formula
I have tried 10,000 x 4 = 40,000 / 2x10^9 and 40,000 / 4x10^9

There isn't a correct way to calculate this. There are a number of dependencies and complexities:
What type of instructions are included? Instructions cycle counts can vary from 1 cycle to 20-30 cycles per instructions. How many of these instructions can be dispatched at once?
What is the memory access pattern and how is the CPU memory access designed? How effective will caching/pre-fetching be (and does the CPU support)?
Are there many branches? How predictable are those branches and how many are within the critical portion of the code? What is the cost of a miss-predict?
and more.
Fundamentally the question you are asking isn't easily solvable and absolutely depends on the code to be run.
Generally speaking, code execution does not scale linearly so it is unlikely that for anything non-trivial that a 4GHz processor will be twice as fast as a 2GHz processor.

Related

Does runtime of an algorithm depends on processor?

I have given a project for finding all the primes below N.And for this the test case timeout condition is [TestMethod, Timeout(1000)],which means our algorithm should execute in <1sec.
When i run the same program on i5 processor it ran successfully but when i run it on i3 the testcases failed because of timeout error.
Does the runtime of an algorithm depends on processor?
What are the factors that effects run time of an algorithm?

Does the runtime of an algorithm depends on processor?
Of course execution time depends on processor. That's why companies like Intel spend vast amounts of money on producing faster and faster processors.
For example: if an algorithm consists of 1 million operations, and a CPU is capable of performing 1 million operations per second, executing that algorithm would take 1 second. On the other hand, given a processor with performance of 10 million operations per second, that algorithm would take only 0.1s. That's very simplified example, as not every instruction takes the same time, and there's more factors involved (cache, memory bandwidth, special instructions like SSE), but it illustrates general point: faster processor executes code in shorter time - that's the point of faster processors.
Note: that's totally unrelated to time complexity.

Computing time in relation to number of operations

is it possible to calculate the computing time of a process based on the number of operations that it performs and the speed of the CPU in GHz?
For example, I have a for loop that performs a total number of 5*10^14 cycles. If it runs on a 2.4 GHz processor, will the computing time in seconds be: 5*10^14/2.4*10^9 = 208333 s?
If the process runs on 4 cores in parallel, will the time be reduced by four?
Thanks for your help.

No, it is not possible to calculate the computing time based just on the number of operations. First of all, based on your question, it sounds like you are talking about the number of lines of code in some higher-level programming language since you mention a for loop. So depending on the optimization level of your compiler, you could see varying results in computation time depending on what kinds of optimizations are done.
But even if you are talking about assembly language operations, it is still not possible to calculate the computation time based on the number of instructions and CPU speed alone. Some instructions might take multiple CPU cycles. If you have a lot of memory access, you will likely have cache misses and have to load data from disk, which is unpredictable.
Also, if the time that you are concerned about is the actual amount of time that passes between the moment the program begins executing and the time it finishes, you have the additional confounding variable of other processes running on the computer and taking up CPU time. The operating system should be pretty good about context switching during disk reads and other slow operations so that the program isn't stopped in the middle of computation, but you can't count on never losing some computation time because of this.
As far as running on four cores in parallel, a program can't just do that by itself. You need to actually write the program as a parallel program. A for loop is a sequential operation on its own. In order to run four processes on four separate cores, you will need to use the fork system call and have some way of dividing up the work between the four processes. If you divide the work into four processes, the maximum speedup you can have is 4x, but in most cases it is impossible to achieve the theoretical maximum. How close you get depends on how well you are able to balance the work between the four processes and how much overhead is necessary to make sure the parallel processes successfully work together to generate a correct result.

Count clock cycles from assembly source code?

I have the source code written and I want to measure efficiency as how many clock cycles it takes to complete a particular task. Where can I learn how many clock cycles different commands take? Does every command take the same amount of time on 8086?

RDTSC is the high-resolution clock fetch instruction.
Bear in mind that cache misses, context switches, instruction reordering and pipelining, and multicore contention can all interfere with the results.

Clock cycles and efficiency are not the same thing.
For efficiency of code you need to consider, in particular, how the memory is utalised, in particular the differing levels of the cache. Also important is the branching prediction of the code etc. You want a profiler that tells you these things, ideally one that gives you profile specific information: examples are CodeAnalyst for AMD chips.
To answer your question, particular base instructions do have a given (average) number of cycles (AMD release the approximate numbers for the basic maths functions in their maths library). These numbers are a poor place to start optimising code, however.

Cycles/byte calculations

In Crypto communities it is common to measure algorithm performance in cycles/byte. My question is, which parameters in the CPU architecture are affecting this number? Except the clockspeed ofcourse :)

Two important factors are:
the ISA of the CPU, or more specifically how closely CPU instructions map to the operations that you need to perform - if you can perform a given operation in one instruction one CPU but it requires 3 instructions on another CPU then the first CPU will probably be faster. If you have specific crypto instructions on the CPU, or extensions such as SIMD which can be leveraged, then so much the better.
the instruction issue rate of the CPU, i.e. how many instructions can be issued per clock cycle

Here are some CPU features that can impact cycles/byte:
depth of pipeline
number of IU and/or FPU able to work in parallel
size of cache memories
algorithms for branch prediction
algorithms for handling cache miss
Moreover, you may be interested in the general problem of assessing WCET (worst case execution time)

Mainly:
Memory bus bandwidth
CPU instructions per cycle
How much memory the CPU can access per second can be a limiting factor. That depends on the algorithm and how big part of the work is memory access. Also which parts of the memory that is accesses will affect how well the memory cache works.
Nowadays instruction times is not measured in how many cycles an instruction takes, but how many instructions can be executed in the same cycle. The pre-processor in the CPU lines up several instructions to be executed in parallel, so it depends on how many parallel lines the CPU has and how well the code can be parallelised. Generally a lot of conditional branching in the algorithm makes it harder to parallelise.

Cost/benefit of parallelization based on code size?

How do you figure out whether it's worth parallelizing a particular code block based on its code size? Is the following calculation correct?
Assume:
Thread pool consisting of one thread per CPU.
CPU-bound code block with execution time of X milliseconds.
Y = min(number of CPUs, number of concurrent requests)
Therefore:
Cost: code complexity, potential bugs
Benefit: (X * Y) milliseconds
My conclusion is that it isn't worth parallelizing for small values of X or Y, where "small" depends on how responsive your requests must be.

One thing that will help you figure that out is Amdahl's Law
The speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program. For example, if a program needs 20 hours using a single processor core, and a particular portion of 1 hour cannot be parallelized, while the remaining promising portion of 19 hours (95%) can be parallelized, then regardless of how many processors we devote to a parallelized execution of this program, the minimal execution time cannot be less than that critical 1 hour. Hence the speed up is limited up to 20x.
Figure out what you want to achieve in speed up, and how much parallelism you can actually achieve, then see if its worth it.

It depends on many factors, as the difficulty of parallelize the code, the speedup obtained from it (there are overhead costs on dividing the problem and joining the results) and the amount of time that the code is spending there (Amdahl's Law)

Well, the benefit is really more:
(X * (Y-1)) * Tc * Pf
Where Tc is the cost of the threading framework you are using. No threading framework scales perfectly, so using 2x threads will likely be, at best, 1.9x speed.
Pf is some factor for parallization that depends completely on the algorithm (ie: whether or not you'll need to lock, which will slow the process down).
Also, it's Y-1, since single threaded is basically assuming Y==1.
As for deciding, it's also a matter of user frustration/expectation (if they user is annoyed at waiting for something, it'd have a greater benefit than a task that the user doesn't really mind - which is not always just due to wait times, etc - it's partly expectations).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio