Calculating with (M)IPS (Instructions Per Second) or other - performance

I have some very heavy code to develop and want to make some calculations beforehand.
Now I'm trying to make a very rough estimate with MIPS, but can not find anything about what MIPS actually stands for. Is an instruction a single bitwise operation/comparison in MIPS?

The best thing you can do is to run your algorithm on some (much) smaller set N. If you can estimate the complexity of your algorithms, you can then estimate how fast it will run for the full dataset.
MIPS is not a good way to go; in most algorithms, CPU spends more than half of time waiting for caches/RAM anyway; only small set of problems allows for very good analysis on how is memory going to be used (e.g. matrix operations) and can be tuned to use CPU efficiently.

Related

Why is algorithm time complexity often defined in terms of steps/operations?

I've been doing a lot of studying from many different resources on algorithm analysis lately, and one thing I'm currently confused about is why time complexity is often defined in terms of the number of steps/operations an algorithm performs.
For instance, in Introduction to Algorithms, 3rd Edition by Cormen, he states:
The running time of an algorithm on a particular input is the number of primitive operations or “steps” executed. It is convenient to define the notion of step so that it is as machine-independent as possible.
I've seen other resources define the time complexity as such as well. I have a problem with this because, for one, it's called TIME complexity, not "step complexity" or "operations complexity." Secondly, while it's not a definitive source, an answer to a post here on Stackoverflow states "Running time is how long it takes a program to run. Time complexity is a description of the asymptotic behavior of running time as input size tends to infinity." Further, on the Wikipedia page for time complexity it states "In computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm." Again, these are definitive sources, things makes logical sense using these definitions.
When analyzing an algorithm and deriving its time complexity function, such as in Figure 1 below, you get an equation that is in units of time. It CAN represent the amount of operations the algorithm performs, but only if those constant factors (C_1, C_2, C_3, etc.) are each a value of 1.
Figure 1
So with all that said, I'm just wondering how it's possible for this to be defined as the number of steps when that's not really what it represents. I'm trying to clear things up and make the connection between time and number of operations. I feel like there is a lot of information that hasn't been explicitly stated in the resources I've studied. Hoping someone can help clear things up for me, and without going into discussion about Big-O because that shouldn't be needed and misses the point of the question, in my opinion.
Thank you everyone for your time and help.
why time complexity is often defined in terms of the number of steps/operations an algorithm performs?
TL;DR: because that is how the asymptotic analysis work; also, do not forget, that time is a relative thing.
Longer story:
Measuring the performance in time, as we, humans understand the time in a daily use, doesn't make much sense, as it is not always that trivial task to do.. furthermore - it even makes no sense in a broader perspective.
How would you measure what is the space and time your algorithm takes? what will be the conditional and predefined unit of the measurement you're going to apply to see the running time/space complexity of your algorithm?
You can measure it on your clock, or use some libraries/API to see exactly how many seconds/minutes/megabytes your algorithm took.. or etc.
However, this all will be VERY much variable! because, the time/space your algorithm took, will depend on:
Particular hardware you're using (architecture, CPU, RAM, etc.);
Particular programming language;
Operating System;
Compiler, you used to compile your high-level code into lower abstraction;
Other environment-specific details (sometimes, even on the temperature.. as CPUs might be scaling operating frequency dynamically)..
therefore, it is not the good thing to measure your complexity in the precise timing (again, as we understand the timing on this planet).
So, if you want to know the complexity (let's say time complexity) of your algorithm, why would it make sense to have a different time for different machines, OSes, and etc.? Algorithm Complexity Analysis is not about measuring runtime on a particular machine, but about having a clear and mathematically defined precise boundaries for the best, average and worst cases.
I hope this makes sense.
Fine, we finally get to the point, that algorithm analysis should be done as a standalone, mathematical complexity analysis.. which would not care what is the machine, OS, system architecture, or anything else (apart from algorithm itself), as we need to observe the logical abstraction, without caring about whether you're running it on Windows 10, Intel Core2Duo, or Arch Linux, Intel i7, or your mobile phone.
What's left?
Best (so far) way for the algorithm analysis, is to do the Asymptotic Analysis, which is an abstract analysis calculated on the basis of input.. and that is counting almost all the steps and operations performed in the algorithm, proportionally to your input.
This way you can speak about the Algorithm, per se, instead of being dependent on the surrounding circumstances.
Moreover; not only we shouldn't care about machine or peripheral factors, we also shouldn't care about Lower Order Terms and Constant Factors in the mathematical expression of the Asymptotic Analysis.
Constant Factors:
Constant Factors are instructions which are independent from the Input data. i.e. which are NOT dependent on the input argument data.
Few reasons why you should ignore them are:
Different programming language syntaxes, as well as their compiled files, will have different number of constant operations/factors;
Different Hardware will give different run-time for the same constant factors.
So, you should eliminate thinking about analyzing constant factors and overrule/ignore them. Only focus on only input-related important factors; therefore:
O(2n) == O(5n) and all these are O(n);
6n2 == 10n2 and all these are n2.
One more reason why we won't care about constant factors is that they we usually want to measure the complexity for sufficiently large inputs.. and when the input grows to the + infinity, it really makes no sense whether you have n or 2n.
Lower order terms:
Similar concept applies in this point:
Lower order terms, by definition, become increasingly irrelevant as you focus on large inputs.
When you have 5x4+24x2+5, you will never care much on exponent that is less than 4.
Time complexity is not about measuring how long an algorithm takes in terms of seconds. It's about comparing different algorithms, how they will perform with a certain amount if input data. And how this performance develops when the input data gets bigger.
In this context, the "number of steps" is an abstract concept for time, that can be compared independently from any hardware. Ie you can't tell how long it will take to execute 1000 steps, without exact specifications of your hardware (and how long one step will take). But you can always tell, that executing 2000 steps will take about twice as long as executing 1000 steps.
And you can't really discuss time complexity without going into Big-O, because that's what it is.
You should note that Algorithms are more abstract than programs. You check two algorithms on a paper or book and you want to analyze which works faster for an input data of size N. So you must analyze them with logic and statements. You can also run them on a computer and measure the time, but that's not proof.
Moreover, different computers execute programs at different speeds. It depends on CPU speed, RAM, and many other conditions. Even a program on a single computer may be run at different speeds depending on available resources at a time.
So, time for algorithms must be independent of how long a single atomic instruction takes to be executed on a specific computer. It's considered just one step or O(1). Also, we aren't interested in constants. For example, it doesn't matter if a program has two or 10 instructions. Both will be run on a fraction of microseconds. Usually, the number of instructions is limited and they are all run fast on computers. What is important are instructions or loops whose execution depends on a variable, which could be the size of the input to the program.

CUDA Sorting Many Vectors / Arrays

I have many (200 000) vectors of integers (around 2000 elements in each vector) in GPU memory.
I am trying to parallelize algorithm which needs to sort, calculate average, standard deviation and skewness for each vector.
In the next step, the algorithm has to delete the maximal element and repeated calculation of statistical moments until some criteria is not fulfilled for each vector independently.
I would like to ask someone more experienced what is the best approach to parallelize this algorithm.
Is it possible to sort more that one vector at once?
Maybe is it better to not parallelize sorting but the whole algorithm as one thread?
200 000 vectors of integers ... 2000 elements in each vector ... in GPU memory.
2,000 integers sounds like something a single GPU block could tackle handily. They would fit in its shared memory (or into its register file, but that would be less useful for various reasons), so you wouldn't need to sort them in global memory. 200,000 vector = 200,000 blocks; but you can't have 2000 block threads - that excessive
You might be able to use cub's block radix sort, as #talonmies suggests, but I'm not too sure that's the right thing to do. You might be able to do it with thrust, but there's also a good chance you'll have a lot of overhead and complex code (I may be wrong though). Give serious consideration to adapting an existing (bitonic) sort kernel, or even writing your own - although that's more challenging to get right.
Anyway, if you write your own kernel, you can code your "next step" after sorting the data.
Maybe is it better to not parallelize sorting but the whole algorithm as one thread?
This depends on how much time your application spends on these sorting efforts at the moment, relative to its entire running time. See also Amdahl's Law for a more formal statement of the above. Having said that - typically it should be worthwhile to parallelize the sorting when you already have data in GPU memory.

why program running time is not a measure?

i have learned that a program is measured by it's complexity - i mean by Big O Notation.
why don't we measure it by it's absolute running time?
thanks :)
You use the complexity of an algorithm instead of absolute running times to reason about algorithms, because the absolute running time of a program does not only depend on the algorithm used and the size of the input. It also depends on the machine it's running on, various implementations detail and what other programs are currently using system resources. Even if you run the same application twice with the same input on the same machine, you won't get exactly the same time.
Consequently when given a program you can't just make a statement like "this program will take 20*n seconds when run with an input of size n" because the program's running time depends on a lot more factors than the input size. You can however make a statement like "this program's running time is in O(n)", so that's a lot more useful.
Absolute running time is not an indicator of how the algorithm grows with different input sets. It's possible for a O(n*log(n)) algorithm to be far slower than an O(n^2) algorithm for all practical datasets.
Running time does not measure complexity, it only measures performance, or the time required to perform the task. An MP3 player will run for the length of the time require to play the song. The elapsed CPU time may be more useful in this case.
One measure of complexity is how it scales to larger inputs. This is useful for planning the require hardware. All things being equal, something that scales relatively linearly is preferable to one which scales poorly. Things are rarely equal.
The other measure of complexity is a measure of how simple the code is. The code complexity is usually higher for programs with relatively linear performance complexity. Complex code can be costly maintain, and changes are more likely to introduce errors.
All three (or four) measures are useful, and none of them are highly useful by themselves. The three together can be quite useful.
The question could use a little more context.
In programming a real program, we are likely to measure the program's running time. There are multiple potential issues with this though
1. What hardware is the program running on? Comparing two programs running on different hardware really doesn't give a meaningful comparison.
2. What other software is running? If anything else running, it's going to steal CPU cycles (or whatever other resource your program is running on).
3. What is the input? As already said, for a small set, a solution might look very fast, but scalability goes out the door. Also, some inputs are easier than others. If as a person, you hand me a dictionary and ask me to sort, I'll hand it right back and say done. Giving me a set of 50 cards (much smaller than a dictionary) in random order will take me a lot longer to do.
4. What is the starting conditions? If your program runs for the first time, chances are, spinning it off the hard disk will take up the largest chunk of time on modern systems. Comparing two implementations with small inputs will likely have their differences masked by this.
Big O notation covers a lot of these issues.
1. Hardware doesn't matter, as everything is normalized by the speed of 1 operation O(1).
2. Big O talks about the algorithm free of other algorithms around it.
3. Big O talks about how the input will change the running time, not how long one input takes. It tells you the worse the algorithm will perform, not how it performs on an average or easy input.
4. Again, Big O handles algorithms, not programs running in a physical system.

Is trigonometry computationally expensive?

I read in an article somewhere that trig calculations are generally expensive. Is this true? And if so, that's why they use trig-lookup tables right?
EDIT: Hmm, so if the only thing that changes is the degrees (accurate to 1 degree), would a look up table with 360 entries (for every angle) be faster?
Expensive is a relative term.
The mathematical operations that will perform fastest are those that can be performed directly by your processor. Certainly integer add and subtract will be among them. Depending upon the processor, there may be multiplication and division as well. Sometimes the processor (or a co-processor) can handle floating point operations natively.
More complicated things (e.g. square root) requires a series of these low-level calculations to be performed. These things are usually accomplished using math libraries (written on top of the native operations your processor can perform).
All of this happens very very fast these days, so "expensive" depends on how much of it you need to do, and how quickly you need it to happen.
If you're writing real-time 3D rendering software, then you may need to use lots of clever math tricks and shortcuts to squeeze every bit of speed out of your environment.
If you're working on typical business applications, odds are that the mathematical calculations you're doing won't contribute significantly to the overall performance of your system.
On the Intel x86 processor, floating point addition or subtraction requires 6 clock cycles, multiplication requires 8 clock cycles, and division 30-44 clock cycles. But cosine requires between 180 and 280 clock cycles.
It's still very fast, since the x86 does these things in hardware, but it's much slower than the more basic math functions.
Since sin(), cos() and tan() are mathematical functions which are calculated by summing a series developers will sometimes use lookup tables to avoid the expensive calculation.
The tradeoff is in accuracy and memory. The greater the need for accuracy, the greater the amount of memory required for the lookup table.
Take a look at the following table accurate to 1 degree.
http://www.analyzemath.com/trigonometry/trig_1.gif
While the quick answer is that they are more expensive than the primitive math functions (addition/multiplication/subtraction etc...) they are not -expensive- in terms of human time. Typically the reason people optimize them with look-up tables and approximations is because they are calling them potentially tens of thousands of times per second and every microsecond could be valuable.
If you're writing a program and just need to call it a couple times a second the built-in functions are fast enough by far.
I would recommend writing a test program and timing them for yourself. Yes, they're slow compared to plus and minus, but they're still single processor instructions. It's unlikely to be an issue unless you're doing a very tight loop with millions of iterations.
Yes, (relative to other mathematical operations multiply, divide): if you're doing something realtime (matrix ops, video games, whatever), you can knock off lots of cycles by moving your trig calculations out of your inner loop.
If you're not doing something realtime, then no, they're not expensive (relative to operations such as reading a bunch of data from disk, generating a webpage, etc.). Trig ops are hopefully done in hardware by your CPU (which can do billions of floating point operations per second).
If you always know the angles you are computing, you can store them in a variable instead of calculating them every time. This also applies within your method/function call where your angle is not going to change. You can be smart by using some formulas (calculating sin(theta) from sin(theta/2), knowing how often the values repeat - sin(theta + 2*pi*n) = sin(theta)) and reducing computation. See this wikipedia article
yes it is. trig functions are computed by summing up a series. So in general terms, it would be a lot more costly then a simple mathematical operation. same goes for sqrt

Maximum Increase in Processing Speed via Parallelism

Are there any cases in which anything more than a linear speed increase comes from parallelising an algorithm ?
The maximum you can reach from a theory viewpoint is linear speedup.
In practice, it is possible super linear speedup. If you can distribute your problem in a away that you can leverage effects of processor caches, e.g. because it does not fit in the cache of a single core, your problem can scale better than linear.
In theory, no - but in practice this might be the case (depending on the underlying hardware and your specific problem). Its not trivial to compare parallel and sequential code (you have to compare the fastest sequential implementation with your parallel implementation, not just your parallel implementation running on a single processor/thread).
But still, when someone speaks about more-than-linear speed-up I would always be suspicious; they either didn't measure it correctly (see above), measured an artifact (hardware/OS dependent) and should document it accordingly, or this only works for a specific combination of problem/implementation/hardware.

Resources