I have given a project for finding all the primes below N.And for this the test case timeout condition is [TestMethod, Timeout(1000)],which means our algorithm should execute in <1sec.
When i run the same program on i5 processor it ran successfully but when i run it on i3 the testcases failed because of timeout error.
Does the runtime of an algorithm depends on processor?
What are the factors that effects run time of an algorithm?
Does the runtime of an algorithm depends on processor?
Of course execution time depends on processor. That's why companies like Intel spend vast amounts of money on producing faster and faster processors.
For example: if an algorithm consists of 1 million operations, and a CPU is capable of performing 1 million operations per second, executing that algorithm would take 1 second. On the other hand, given a processor with performance of 10 million operations per second, that algorithm would take only 0.1s. That's very simplified example, as not every instruction takes the same time, and there's more factors involved (cache, memory bandwidth, special instructions like SSE), but it illustrates general point: faster processor executes code in shorter time - that's the point of faster processors.
Note: that's totally unrelated to time complexity.
Related
I am trying to find out how long it takes to execute 10,000 RISC instructions with 4 bytes from a processor that is 2GHz and another that is 4GHz, I only need the very basics of a formula
I have tried 10,000 x 4 = 40,000 / 2x10^9 and 40,000 / 4x10^9
There isn't a correct way to calculate this. There are a number of dependencies and complexities:
What type of instructions are included? Instructions cycle counts can vary from 1 cycle to 20-30 cycles per instructions. How many of these instructions can be dispatched at once?
What is the memory access pattern and how is the CPU memory access designed? How effective will caching/pre-fetching be (and does the CPU support)?
Are there many branches? How predictable are those branches and how many are within the critical portion of the code? What is the cost of a miss-predict?
and more.
Fundamentally the question you are asking isn't easily solvable and absolutely depends on the code to be run.
Generally speaking, code execution does not scale linearly so it is unlikely that for anything non-trivial that a 4GHz processor will be twice as fast as a 2GHz processor.
I read a paper in which parallel cost for (parallel) algorithms is defined as CP(n) = p * TP(n), where p is the number of processors, T the processing time and n the input. An algorithm is cost-optimal, if CP(n) is approximately constant, i.e. if the algorithm uses two processors instead of one on the same input it takes only half the time.
There is another concept called parallel work, which I don't fully grasp.
The papers says it measures the number of executed parallel OPs.
A algorithm is work-optimal, if it performs as many OPs as it sequential counterpart (asymptotically). A cost-optimal algorithm is always work-optimal but not vice versa.
Can someone illustrate the concept of parallel work and show the similarities and differences to parallel cost?
It sounds like parallel work is simply a measure of the total number of instructions ran by all processes in parallel but counting the ones in parallel only once. If that's the case, then it's more closely related to the time term in your parallel cost equation. Think of it this way: if the parallel version of the algorithm runs more instructions than the sequential version --meaning it is not work-optimal, it will necessarily take more time assuming all instructions are equal in duration. Typically these extra instructions are at the beginning or end of the parallel algorithm and are viewed as overhead of the parallel algorithm. They can correspond to extra bookkeeping or communication or final aggregation of the result.
Thus an algorithm that is not work-optimal cannot be cost-optimal.
Another way to call the "parallel cost" is "cost of context switching" although it can also arise from interdependencies between the different threads.
Consider sorting.
If you implement Bubble Sort in parallel where each thread just picks up the next comparison you will have a huge cost to run it in "parallel", to the point where it will be essentially a messed up sequential version of the algorithm and your parallel work will be essentially zero because most threads just wait most of the time.
Now compare that to Quick Sort and implement a thread for each split of the original array - threads don't need data from other threads, and asymptotically for a bigger starting arrays the cost of spinning these threads will be paid by the parallel nature of the work done... if the system has infinite memory bandwidth. In reality it wouldn't be worth spinning more threads than there are memory access channels because the threads still have the invisible (from code perspective) dependency between them by having shared sequential access to memory
Short
I think parallel cost and parallel work are two sides of the same coin. They're both measures for speed-up,
whereby the latter is the theoretical concept enabling the former.
Long
Let's consider n-dimensional vector addition as a problem that is easy to parallelize, since it can be broken down into n independent tasks.
The problem is inherently work-optimal, because the parallel work doesn't change if the algorithm runs in parallel, there are always n vector components that need to be added.
Considering the parallel cost cannot be done without executing the algorithm on a (virtual) machine, where practical limitations like shortage of memory bandwidth arise. Thus a work-optimal algorithm can only be cost-optimal if the hardware (or the hardware access patterns) allows the perfect execution and division of the problem - and the time.
Cost-optimality is a stronger demand, and as I'm realizing now, just another illustration of efficiency
Under normal circumstances a cost-optimal algorithm will also be work-optimal,
but if the speed-up gained by caching, memory access patterns etc. is super-linear,
i.e. the execution times with two processors is one-tenth instead of the expected half, it is possible for an algorithm that performs more work, and thus is not work-optimal, to still be cost-optimal.
is it possible to calculate the computing time of a process based on the number of operations that it performs and the speed of the CPU in GHz?
For example, I have a for loop that performs a total number of 5*10^14 cycles. If it runs on a 2.4 GHz processor, will the computing time in seconds be: 5*10^14/2.4*10^9 = 208333 s?
If the process runs on 4 cores in parallel, will the time be reduced by four?
Thanks for your help.
No, it is not possible to calculate the computing time based just on the number of operations. First of all, based on your question, it sounds like you are talking about the number of lines of code in some higher-level programming language since you mention a for loop. So depending on the optimization level of your compiler, you could see varying results in computation time depending on what kinds of optimizations are done.
But even if you are talking about assembly language operations, it is still not possible to calculate the computation time based on the number of instructions and CPU speed alone. Some instructions might take multiple CPU cycles. If you have a lot of memory access, you will likely have cache misses and have to load data from disk, which is unpredictable.
Also, if the time that you are concerned about is the actual amount of time that passes between the moment the program begins executing and the time it finishes, you have the additional confounding variable of other processes running on the computer and taking up CPU time. The operating system should be pretty good about context switching during disk reads and other slow operations so that the program isn't stopped in the middle of computation, but you can't count on never losing some computation time because of this.
As far as running on four cores in parallel, a program can't just do that by itself. You need to actually write the program as a parallel program. A for loop is a sequential operation on its own. In order to run four processes on four separate cores, you will need to use the fork system call and have some way of dividing up the work between the four processes. If you divide the work into four processes, the maximum speedup you can have is 4x, but in most cases it is impossible to achieve the theoretical maximum. How close you get depends on how well you are able to balance the work between the four processes and how much overhead is necessary to make sure the parallel processes successfully work together to generate a correct result.
I have obtained running time of my program (CPU time) in seconds for different input sizes and I want to build a chart showing CPU time and theoretical running time (for example O(n³)).
How should I scale CPU time?
CPU time and BigO notation are two different things, so they should not be compared.
The idea behind BigO notation is that it gives you a platform independent way of gauging the efficiency of an algorithm. Different algorithms will run in different times on different CPUs depending on speed, load, etc... of the CPU. Therefore, the CPU is not a good metric. The question with any algorithm is whether it's written as efficiently as possible.
This is where BigO notation comes in. It measures the number of operations required to accomplish a task, and as such is platform independent. If you reduce the number of operations required to complete a task, then -- all else remaining equal -- you will improve your efficiency, regardless of platform.
Therefore, count the number of operations in the program and derive the BigO complexity from that. Then you can measure it against the theoretical.
A theoretical question, maybe it is obvious:
Is it possible that an algorithm, after being implemented in a parallel way with N threads, will be executed more than N times faster than the original, single-threaded algorithm? In other words, can the gain be better that linear with number of threads?
It's not common, but it most assuredly is possible.
Consider, for example, building a software pipeline where each step in the pipeline does a fairly small amount of calculation, but requires enough static data to approximately fill the entire data cache -- but each step uses different static data.
In a case like this, serial calculation on a single processor will normally be limited primarily by the bandwidth to main memory. Assuming you have (at least) as many processors/cores (each with its own data cache) as pipeline steps, you can load each data cache once, and process one packet of data after another, retaining the same static data for all of them. Now your calculation can proceed at the processor's speed instead of being limited by the bandwidth to main memory, so the speed improvement could easily be 10 times greater than the number of threads.
Theoretically, you could accomplish the same with a single processor that just had a really huge cache. From a practical viewpoint, however, the selection of processors and cache sizes is fairly limited, so if you want to use more cache you need to use more processors -- and the way most systems provide to accomplish this is with multiple threads.
Yes.
I saw an algorithm for moving a robot arm through complicated maneuvers that was basically to divide into N threads, and have each thread move more or less randomly through the solution space. (It wasn't a practical algorithm.) The statistics clearly showed a superlinear speedup over one thread. Apparently the probability of hitting a solution over time rose fairly fast and then leveled out some, so the advantage was in having a lot of initial attempts.
Amdahl's law (parallelization) tells us this is not possible for the general case. At best we can perfectly divide the work by N. The reason for this is that given no serial portion, Amdahl's formula for speedup becomes:
Speedup = 1/(1/N)
where N is the number of processors. This of course reduces to just N.