Real running time calculation of matrix multiplication [closed] - performance

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
I would like to calculate approximately the running time of a matrix multiplication problem. Below are my assumptions:
No parallel programming
A 2 Ghz CPU
A square matrix of size n
An O(n^3) algorithm
For example suppose that n = 1000. So, how much time (approximately) should I expect taking the square of this matrix will take on the above assumptions.
Thanks.

This really terribly depends on the algorithm and the CPU. Even without parallelization, there's a lot of freedom in how the same steps would be represented on a CPU, and differences (in clock cycles needed for various operations) between different CPU's of the same family, too. Don't forget, either, that modern CPUs add some parallelization of instructions on their own. Optimization done by the compiler will make a difference in reordering memory order and branches and will likely convert instructions to vectorized ones even if you didn't specify that. Depending on further factors it may make a difference, too, whether your matrices are in a fixed location in memory or if you are accessing them by a pointer, and whether they are allocated with fixed size or each row / column dynamically. Don't forget about memory caching, page invalidations, and operation system scheduling, as I did in previous versions of my answer.
If this is for your own rough estimate or for a "typical" case, you won't do much wrong by just writing the program, running it in your specific conditions (as discussed above) in many repetitions for n = 1000, and calculating the average.
If you want a lot of hard work for a worse result, you can actually do what you probably meant to do in your original question yourself:
see what instructions your specific compiler produces for your specific algorithm under your specific conditions and with specific optimization settings (like here)
pick your specific processor and find its latency table for every instruction that's there,
add them up per iteration and multiply by 1000^3,
divide by the clock frequency.
Seriously, it's not worth the effort, a benchmark is faster, clearer, and more precise anyway (as this does not account for what happens in the branch predictor and hyperthreading and memory caching and other architectural details). If you want an exercise I'll leave that to you.

Related

Is big-O notation relevant for concurrent world? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
In most popular languages like C/C++/C#/Erlang/Java we have threads/processes; there is GPGPU computation market growing. If algorithm requires N data independent steps we get not the same performance as if algorithm would require all steps follow one another. So I wonder if big-O notation makes sense in concurrent world? And if it does not what is relevant to analyze algorithm performance?
You can have N or more processors in distributed environment (GPGPU / cluster / FPGA of future where you can get as many cores as you need - concurrent world, not limited to the number of parallel cores)
Big-O notation is still relevant.
You have a constant number of processors (by assumption), thus only a constant number of things can happen concurrently, thus you can only speed up an algorithm by a constant factor, and big-O ignores constant factors.
So whether you look at the total number of steps, or only consider the number of steps taken by the processor processing the most steps, we end up with exactly the same time complexity, and this still ends up giving us a decent idea of the rate of growth of the time taken in relation to the input size.
... future where you can get as much cores as you need - concurrent world, not limited to the number of parallel cores.
If we even get to the stage where you can run an algorithm with exponential running time on very large input in seconds, then yes, big-O notation, and the study of algorithms, will probably become much less relevant.
But considering, for example, that for an O(n!) algorithm, with n = 1000 (which is pretty small to be honest), it will require in the region of 4x10^2567 steps, which is about 4x10^2480 times more than the mere 10^87 estimated number of atoms in the entire observable universe. In short, big-O notation is unlikely to ever become irrelevant.
Even on the assumption of an effectively unlimited number of processors, we can still use big-O notation to indicate the steps taken by the processor processing the most steps (which should indicate the running time).
We could also use it to indicate the number of processors used, if we'd like.
The bottom line is that big-O notation is to show the rate of growth of a function - a function which could represent just about anything. Just because we typically use it to indicate the number of arithmetic computations, steps, comparisons or similar doesn't mean those are the only things we can use it for.
Big-O is a mathematical concept. It's always relevant, it would have been relevant before computers even existed, it's relevant now, it will always be relevant, it's even relevant to hypothetical aliens millions of light years away (if they know about it). Big-O is not just something we use to talk about how running time scales, it has a mathematical definition and it's about functions.
But there are many models of computation (unfortunately many people forget that, and even forget that what they're using is a model at all) and which ones make sense to use is not always the same.
For example, if you're talking about the properties of a parallel algorithm, assuming you have a constant number of processing elements essentially ignores the parallel nature of the algorithm. So in order to be able to express more, a commonly used model (but by no means the only one) is PRAM.
That you don't actually have an unlimited number of processing elements in reality is of no import. It's a model. The whole point is to abstract reality away. You don't have unlimited memory either, which is one of the assumptions of a Turing machine.
For models even further removed from reality, see hypercomputation.
Multithreading and using gpu's just uses parallelization to speed up the algorithms. But there are algorithms that cannot be speeded up this way.
Even if algorithms can speeded up by parallelization, a O(N log N) algorithm will be much faster than a O(N²) algorithm.

A couple of CUDA-performance questions

This is the first time i ask question here so thanks very much in advance and please forgive my ignorance. And also I've just started to CUDA programming.
Basically, i have a bunch of points, and i want to calculate all the pair-wise distances. Currently my kernel function just holds on one point, and iteratively read in all other points (from global memory), and conduct the calculation. Here's some of my confusions:
I'm using a Tesla M2050 with 448 cores. But my current parallel version (kernel<<<128,16,16>>>) achieves a much higher parallelism (about 600x faster than kernel<<<1,1,1>>>). Is it possibly due to the multithreading thing or pipeline issue, or they actually indicate the same thing?
I want to further improve the performance. So i figure to use shared memory to hold some input points for each multiprocessing block. But the new code is just as fast. What's the possible cause? Could it be related to the fact that i set too many threads?
Or, is it because i have a if-statement in the code? The thing is, i only consider and count the short distances, so i have a statement like (if dist < 200). How much should i worry about this one?
A million thanks!
Bin
Mark Harris has a very good presentation about optimizing CUDA: Optimizing Parallel Reduction in CUDA.
Algorithmic optimizations
Changes to addressing, algorithm cascading
11.84x speedup, combined!
Code optimizations
Loop unrolling
2.54x speedup, combined
Having an extra operations statement, does indeed cause problems although it will be the last thing you want to optimize, if not simply because you need to know the layout of your code before implementing the size assumptions!
The problem you are working on sounds like the famous n-body problem,
see Fast N-Body Simulation with CUDA.
An additional performance increase can be achieved if you can avoid doing a pairwise computation, for example, the elements are too far to have an effect on each-other. This applies to any relationship that can be expressed geometrically, whether it be pairwise costs or a physics simulation with springs. My favorite method is to divide the grid into boxes and, with each element putting itself into a box via division, then only evaluate pairwise relations between between neighboring boxes. This can be called O(n*m).
(1) The GPU runs many more threads in parallel than there are cores. This is because each core is pipelined. Operations take around 20 cycles on compute capability 2.0 (Fermi) architectures. So for each clock cycle, the core starts work on a new operation, returns the finished result of one operation, and move all the other (around 18) operations one more step towards completion. So, to saturate the GPU, you might need something like 448 * 20 threads.
(2) It's probably because your values are getting cached in the L1 and L2 caches.
(3) It depends on how much work you're doing inside the if conditional. The GPU must run all 32 threads in a warp through all the code inside the if even if the condition is true for only a single of those threads. If there is a lot of code in the conditional as compared to the rest of your kernel, and relatively view threads go through that code path, it is likely that you end up with low compute throughput.

Why are numbers rounded in standard computer algorithms? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
I understand that this makes the algorithms faster and use less storage space, and that these would have been critical features for software to run on the hardware of previous decades, but is this still an important feature? If the calculations were done with exact rational arithmetic then there would be no rounding errors at all, which would simplify many algorithms as you would no longer have to worry about catastrophic cancellation or anything like that.
Floating point is much faster than arbitrary-precision and symbolic packages, and 12-16 significant figures is usually plenty for demanding science/engineering applications where non-integral computations are relevant.
The programming language ABC used rational numbers (x / y where x and y were integers) wherever possible.
Sometimes calculations would become very slow because the numerator and denominator had become very big.
So it turns out that it's a bad idea if you don't put some kind of limit on the numerator and denominator.
In the vast majority of computations, the size of numbers required to to compute answers exactly would quickly grow beyond the point where computation would be worth the effort, and in many calculations it would grow beyond the point where exact calculation would even be possible. Consider that even running something like like a simple third-order IIR filter for a dozen iterations would require a fraction with thousands of bits in the denominator; running the algorithm for a few thousand iterations (hardly an unusual operation) could require more bits in the denominator than there exist atoms in the universe.
Many numerical algorithms still require fixed-precision numbers in order to perform well enough. Such calculations can be implemented in hardware because the numbers fit entirely in registers, whereas arbitrary precision calculations must be implemented in software, and there is a massive performance difference between the two. Ask anybody who crunches numbers for a living whether they'd be ok with things running X amount slower, and they probably will say "no that's completely unworkable."
Also, I think you'll find that having arbitrary precision is impractical and even impossible. For example, the number of decimal places can grow fast enough that you'll want to drop some. And then you're back to square one: rounded number problems!
Finally, sometimes the numbers beyond a certain precision do not matter anyway. For example, generally the nnumber of significant digits should reflect the level of experimental uncertainty.
So, which algorithms do you have in mind?
Traditionally integer arithmetic is easier and cheaper to implement in hardware (uses less space on the die so you can fit more units on there). Especially when you go into the DSP segment this can make a lot of difference.

Robust algorithm for chromatic instrument tuner? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
Who knows the most robust algorithm for a chromatic instrument tuner?
I am trying to write an instrument tuner. I have tried the following two algorithms:
FFT to create a welch periodogram and then detect the peak frequency
A simple autocorrelation (http://en.wikipedia.org/wiki/Autocorrelation)
I encountered the following basic problems:
Accuracy 1: in FFT the relation between samplerate, recording length and bin size is fixed. This means that I need to record a 1-2 seconds of data to get an accuracy of a few cents. This is not exactly what i would call realtime.
Accuracy 2: autocorrelation works a bit better. To get the needed accuracy of a few cents I had to introduced linear interpolation of samples.
Robustness: In case of a guitar I see a lot of overtones. Some overtones are actually stronger than the main tone produced by the string. I could not find a robust way to select the right string played.
Still, any cheap electronic tuner works more robust than my implementation.
How are those tuners implemented?
You can interpolate FFTs also, and you can often use the higher harmonics for increased precision. You need to know a little bit about the harmonics of the instrument that was produced, and it's easier if you can assume you're less than half an octave off target, but even in the absence of that, the fundamental frequency is usually much stronger than the first subharmonic, and is not that far below the primary harmonic. A simple heuristic should let you pick the fundamental frequency.
I doubt that the autocorrelation method will work all that robustly across instruments, but you should get a series of self-similarity scores that is highest when you're offset by one fundamental frequency. If you go two, you should get the same score again (to within noise and differential damping of the different harmonics).
There's a pretty cool algorithm called Bitstream Autocorrelation. It doesn't take too many CPU cycles, and it's very accurate. You basically find all the zero cross points, and then save it as a binary string. Then you use Auto-correlation on the string. It's fast because you can use XOR instead of floating point multiplication.

What is an efficient way to go beyond a greedy algorithm

The domain of this question is scheduling operations on constrained hardware. The resolution of the result is the number of clock cycles the schedule fits within. The search space grows very rapidly where early decisions constrain future decisions and the total number of possible schedules grows rapidly and exponentially. A lot of the possible schedules are equivalent because just swapping the order of two instructions usually result in the same timing constraint.
Basically the question is what is a good strategy for exploring the vast search space without spending too much time. I expect to search only a small fraction but would like to explore different parts of the search space while doing so.
The current greedy algorithm tend to make stupid decisions early on sometimes and the attempt at branch and bound was beyond slow.
Edit:
Want to point out that the result is very binary with perhaps the greedy algorithm ending up using 8 cycles while there exists a solution using only 7 cycles using branch and bound.
Second point is that there are significant restrictions in data routing between instructions and dependencies between instructions that limits the amount of commonality between solutions. Look at it as a knapsack problem with a lot of ordering constraints as well as some solutions completely failing because of routing congestion.
Clarification:
In each cycle there is a limit to how many operations of each type and some operations have two possible types. There are a set of routing constraints which can be varied to be either fairly tight or pretty forgiving and the limit depends on routing congestion.
Integer linear optimization for NP-hard problems
Depending on your side constraints, you may be able to use the critical path method or
(as suggested in a previous answer) dynamic programming. But many scheduling problems are NP-hard just like the classical traveling sales man --- a precise solution has a worst case of exponential search time, just as you describe in your problem.
It's important to know that while NP-hard problems still have a very bad worst case solution time there is an approach that very often produces exact answers with very short computations (the average case is acceptable and you often don't see the worst case).
This approach is to convert your problem to a linear optimization problem with integer variables. There are free-software packages (such as lp-solve) that can solve such problems efficiently.
The advantage of this approach is that it may give you exact answers to NP-hard problems in acceptable time. I used this approach in a few projects.
As your problem statement does not include more details about the side constraints, I cannot go into more detail how to apply the method.
Edit/addition: Sample implementation
Here are some details about how to implement this method in your case (of course, I make some assumptions that may not apply to your actual problem --- I only know the details form your question):
Let's assume that you have 50 instructions cmd(i) (i=1..50) to be scheduled in 10 or less cycles cycle(t) (t=1..10). We introduce 500 binary variables v(i,t) (i=1..50; t=1..10) which indicate whether instruction cmd(i) is executed at cycle(t) or not. This basic setup gives the following linear constraints:
v_it integer variables
0<=v_it; v_it<=1; # 1000 constraints: i=1..50; t=1..10
sum(v_it: t=1..10)==1 # 50 constraints: i=1..50
Now, we have to specify your side conditions. Let's assume that operations cmd(1)...cmd(5) are multiplication operations and that you have exactly two multipliers --- in any cycle, you may perform at most two of these operations in parallel:
sum(v_it: i=1..5)<=2 # 10 constraints: t=1..10
For each of your resources, you need to add the corresponding constraints.
Also, let's assume that operation cmd(7) depends on operation cmd(2) and needs to be executed after it. To make the equation a little bit more interesting, lets also require a two cycle gap between them:
sum(t*v(2,t): t=1..10) + 3 <= sum(t*v(7,t): t=1..10) # one constraint
Note: sum(t*v(2,t): t=1..10) is the cycle t where v(2,t) is equal to one.
Finally, we want to minimize the number of cycles. This is somewhat tricky because you get quite big numbers in the way that I propose: We give assign each v(i,t) a price that grows exponentially with time: pushing off operations into the future is much more expensive than performing them early:
sum(6^t * v(i,t): i=1..50; t=1..10) --> minimum. # one target function
I choose 6 to be bigger than 5 to ensure that adding one cycle to the system makes it more expensive than squeezing everything into less cycles. A side-effect is that the program will go out of it's way to schedule operations as early as possible. You may avoid this by performing a two-step optimization: First, use this target function to find the minimal number of necessary cycles. Then, ask the same problem again with a different target function --- limiting the number of available cycles at the outset and imposing a more moderate price penalty for later operations. You have to play with this, I hope you got the idea.
Hopefully, you can express all your requirements as such linear constraints in your binary variables. Of course, there may be many opportunities to exploit your insight into your specific problem to do with less constraints or less variables.
Then, hand your problem off to lp-solve or cplex and let them find the best solution!
At first blush, it sounds like this problem might fit into a dynamic programming solution. Several operations may take the same amount of time so you might end up with overlapping subproblems.
If you can map your problem to the "travelling salesman" (like: Find the optimal sequence to run all operations in minimum time), then you have an NP-complete problem.
A very quick way to solve that is the ant algorithm (or ant colony optimization).
The idea is that you send an ant down every path. The ant spreads a smelly substance on the path which evaporates over time. Short parts mean that the path will stink more when the next ant comes along. Ants prefer smelly over clean paths. Run thousands of ants through the network. The most smelly path is the optimal one (or at least very close).
Try simulated annealing, cfr. http://en.wikipedia.org/wiki/Simulated_annealing .

Resources