Memory and time issues when dividing two matrices - performance

I have two sparse matrices in matlab
M1 of size 9thousandx1.8million and M2 of size 1.8millionx1.8million.
Now I need to calculate the expression
M1/M2
and it took me like an hour. Is it normal? Is there any efficient way in matlab so that I can overcome this time issue. I mean it's a lot and if I make number of iterations then it will keep on taking 1 hour. Any suggestion?

A quick back-of-the-envelope calculation based on assuming some iterative method like conjugate gradient or Kaczmarz method is used, and plugging in the sizes makes me believe that an hour isn't bad.
Because of the tridiagonality the matrix that's being "inverted" (if not explicitly), both of those methods are going to take a number of instructions near "some near-unity scalar factor" times ~9000 times 1.8e6 times "the number of iterations required for convergence". The product of the two things in quotes is probably around 50 (minimum) to around 1000 (maximum). I didn't cherry pick these to make your math work, these are about what I'd expect from having done these. If you assume about 1e9 instructions per second (which doesn't account much for memory access etc.) you get around 13 minutes to around 4.5 hours.
Thus, it seems in the right range for an algorithm that's exploiting sparsity.
Might be able to exploit it better yourself if you know the structure, but probably not by much.
Note, this isn't to say that 13 minutes is achievable.
Edit: One side note, I'm not sure what's being used, but I assumed iterative methods. It's also possible that direct methods are used (like explained here). These methods can be very efficient for sparse systems if you exploit the sparsity right. It's very possible that Matlab is using these by default, but it's worth investigating what Matlab is doing in your case.
In my limited experience, iterative methods were usually preferred over direct methods as the size of the systems get large (yours is large.) Our linear systems worked out to be block tridiagonal as well, as they often do in image processing.

Related

CUDA Sorting Many Vectors / Arrays

I have many (200 000) vectors of integers (around 2000 elements in each vector) in GPU memory.
I am trying to parallelize algorithm which needs to sort, calculate average, standard deviation and skewness for each vector.
In the next step, the algorithm has to delete the maximal element and repeated calculation of statistical moments until some criteria is not fulfilled for each vector independently.
I would like to ask someone more experienced what is the best approach to parallelize this algorithm.
Is it possible to sort more that one vector at once?
Maybe is it better to not parallelize sorting but the whole algorithm as one thread?
200 000 vectors of integers ... 2000 elements in each vector ... in GPU memory.
2,000 integers sounds like something a single GPU block could tackle handily. They would fit in its shared memory (or into its register file, but that would be less useful for various reasons), so you wouldn't need to sort them in global memory. 200,000 vector = 200,000 blocks; but you can't have 2000 block threads - that excessive
You might be able to use cub's block radix sort, as #talonmies suggests, but I'm not too sure that's the right thing to do. You might be able to do it with thrust, but there's also a good chance you'll have a lot of overhead and complex code (I may be wrong though). Give serious consideration to adapting an existing (bitonic) sort kernel, or even writing your own - although that's more challenging to get right.
Anyway, if you write your own kernel, you can code your "next step" after sorting the data.
Maybe is it better to not parallelize sorting but the whole algorithm as one thread?
This depends on how much time your application spends on these sorting efforts at the moment, relative to its entire running time. See also Amdahl's Law for a more formal statement of the above. Having said that - typically it should be worthwhile to parallelize the sorting when you already have data in GPU memory.

Multiple parameter optimization with a stochastic element

I am looking for a method to find the best parameters for a simulation. It's about break-shots in billiards / pool. A shot is defined by 7 parameters, I can simulate the shot and then rate the outcome and I would like to compute the best parameters.
I have found the following link here:
Multiple parameter optimization with lots of local minima
suggesting 4 kinds of algorithms. In the pool simulator I am using, the shots are altered by a little random value each time it is simulated. If I simulate the same shot twice, the outcome will be different. So I am looking for an algorithm like the ones in the link above, only with the addition of a stochastical element, optimizing for the 7 parameters that will on average yield the best parameters, i.e. a break shot that most likely will be a success. My initial idea was simulating the shot 100 or 1000 times and just take the average as rating for the algorithms above, but I still feel like there is a better way. Does anyone have an idea?
The 7 parameters are continuous but within different ranges (one from 0 to 10, another from 0.0 to 0.028575 and so on).
Thank you
At least for some of the algorithms, simulating the same shot repeatedly might not be neccessary. As long as your alternatives have some form of momentum, like in the swarm simulation approach, you can let that be affected by the outcome of each individual simulation. In that case, a single unlucky simulation would slow the movement in parameter space only slightly, whereas a serious loss of quality should be enough to stop and reverse the movement. Thos algorithms which don't use momentum might be tweaked to have momentum. If not, then repeated simulation seems the best approach. Unless you can get your hands on the internals of the simulator, and rate the shot as a whole without having to simulate it over and over again.
You can use the algorithms you mentioned in your non-deterministic scenario with independent stochastic runs. Your idea with repeated simulations is good, you can read more about how many repeats you might have to consider for your simulations (unfortunately, there is no trivial answer). If you are not so much into maths, and the runs go fast, do 1.000 repeats, then 10.000 repeats, and see if the results differ largely. If yes, you have to collect more samples, if not, you are probably on the safe side (the central limit theorem states that the results converge).
Further, do not just consider the average! Make sure to look into the standard deviation for each algorithm's results; you might want to use box plots to compare their quartiles. If you rely on the average only, you could pick an algorithm that produces very varying results, sometimes excellent, sometimes terrible in performance.
I don't know what language you are using, but if you use Java, I am maintaining a tool that could simplify your "monte carlo" style experiments.

Smoothing values over time: moving average or something better?

I'm coding something at the moment where I'm taking a bunch of values over time from a hardware compass. This compass is very accurate and updates very often, with the result that if it jiggles slightly, I end up with the odd value that's wildly inconsistent with its neighbours. I want to smooth those values out.
Having done some reading around, it would appear that what I want is a high-pass filter, a low-pass filter or a moving average. Moving average I can get down with, just keep a history of the last 5 values or whatever, and use the average of those values downstream in my code where I was once just using the most recent value.
That should, I think, smooth out those jiggles nicely, but it strikes me that it's probably quite inefficient, and this is probably one of those Known Problems to Proper Programmers to which there's a really neat Clever Math solution.
I am, however, one of those awful self-taught programmers without a shred of formal education in anything even vaguely related to CompSci or Math. Reading around a bit suggests that this may be a high or low pass filter, but I can't find anything that explains in terms comprehensible to a hack like me what the effect of these algorithms would be on an array of values, let alone how the math works. The answer given here, for instance, technically does answer my question, but only in terms comprehensible to those who would probably already know how to solve the problem.
It would be a very lovely and clever person indeed who could explain the sort of problem this is, and how the solutions work, in terms understandable to an Arts graduate.
If you are trying to remove the occasional odd value, a low-pass filter is the best of the three options that you have identified. Low-pass filters allow low-speed changes such as the ones caused by rotating a compass by hand, while rejecting high-speed changes such as the ones caused by bumps on the road, for example.
A moving average will probably not be sufficient, since the effects of a single "blip" in your data will affect several subsequent values, depending on the size of your moving average window.
If the odd values are easily detected, you may even be better off with a glitch-removal algorithm that completely ignores them:
if (abs(thisValue - averageOfLast10Values) > someThreshold)
{
thisValue = averageOfLast10Values;
}
Here is a guick graph to illustrate:
The first graph is the input signal, with one unpleasant glitch. The second graph shows the effect of a 10-sample moving average. The final graph is a combination of the 10-sample average and the simple glitch detection algorithm shown above. When the glitch is detected, the 10-sample average is used instead of the actual value.
If your moving average has to be long in order to achieve the required smoothing, and you don't really need any particular shape of kernel, then you're better off if you use an exponentially decaying moving average:
a(i+1) = tiny*data(i+1) + (1.0-tiny)*a(i)
where you choose tiny to be an appropriate constant (e.g. if you choose tiny = 1- 1/N, it will have the same amount of averaging as a window of size N, but distributed differently over older points).
Anyway, since the next value of the moving average depends only on the previous one and your data, you don't have to keep a queue or anything. And you can think of this as doing something like, "Well, I've got a new point, but I don't really trust it, so I'm going to keep 80% of my old estimate of the measurement, and only trust this new data point 20%". That's pretty much the same as saying, "Well, I only trust this new point 20%, and I'll use 4 other points that I trust the same amount", except that instead of explicitly taking the 4 other points, you're assuming that the averaging you did last time was sensible so you can use your previous work.
Moving average I can get down with ...
but it strikes me that it's probably
quite inefficient.
There's really no reason a moving average should be inefficient. You keep the number of data points you want in some buffer (like a circular queue). On each new data point, you pop the oldest value and subtract it from a sum, and push the newest and add it to the sum. So every new data point really only entails a pop/push, an addition and a subtraction. Your moving average is always this shifting sum divided by the number of values in your buffer.
It gets a little trickier if you're receiving data concurrently from multiple threads, but since your data is coming from a hardware device that seems highly doubtful to me.
Oh and also: awful self-taught programmers unite! ;)
An exponentially decaying moving average can be calculated "by hand" with only the trend if you use the proper values. See http://www.fourmilab.ch/hackdiet/e4/ for an idea on how to do this quickly with a pen and paper if you are looking for “exponentially smoothed moving average with 10% smoothing”. But since you have a computer, you probably want to be doing binary shifting as opposed to decimal shifting ;)
This way, all you need is a variable for your current value and one for the average. The next average can then be calculated from that.
there's a technique called a range gate that works well with low-occurrence spurious samples. assuming the use of one of the filter techniques mentioned above (moving average, exponential), once you have "sufficient" history (one Time Constant) you can test the new, incoming data sample for reasonableness, before it is added to the computation.
some knowledge of the maximum reasonable rate-of-change of the signal is required. the raw sample is compared to the most recent smoothed value, and if the absolute value of that difference is greater than the allowed range, that sample is thrown out (or replaced with some heuristic, eg. a prediction based on slope; differential or the "trend" prediction value from double exponential smoothing)

Is trigonometry computationally expensive?

I read in an article somewhere that trig calculations are generally expensive. Is this true? And if so, that's why they use trig-lookup tables right?
EDIT: Hmm, so if the only thing that changes is the degrees (accurate to 1 degree), would a look up table with 360 entries (for every angle) be faster?
Expensive is a relative term.
The mathematical operations that will perform fastest are those that can be performed directly by your processor. Certainly integer add and subtract will be among them. Depending upon the processor, there may be multiplication and division as well. Sometimes the processor (or a co-processor) can handle floating point operations natively.
More complicated things (e.g. square root) requires a series of these low-level calculations to be performed. These things are usually accomplished using math libraries (written on top of the native operations your processor can perform).
All of this happens very very fast these days, so "expensive" depends on how much of it you need to do, and how quickly you need it to happen.
If you're writing real-time 3D rendering software, then you may need to use lots of clever math tricks and shortcuts to squeeze every bit of speed out of your environment.
If you're working on typical business applications, odds are that the mathematical calculations you're doing won't contribute significantly to the overall performance of your system.
On the Intel x86 processor, floating point addition or subtraction requires 6 clock cycles, multiplication requires 8 clock cycles, and division 30-44 clock cycles. But cosine requires between 180 and 280 clock cycles.
It's still very fast, since the x86 does these things in hardware, but it's much slower than the more basic math functions.
Since sin(), cos() and tan() are mathematical functions which are calculated by summing a series developers will sometimes use lookup tables to avoid the expensive calculation.
The tradeoff is in accuracy and memory. The greater the need for accuracy, the greater the amount of memory required for the lookup table.
Take a look at the following table accurate to 1 degree.
http://www.analyzemath.com/trigonometry/trig_1.gif
While the quick answer is that they are more expensive than the primitive math functions (addition/multiplication/subtraction etc...) they are not -expensive- in terms of human time. Typically the reason people optimize them with look-up tables and approximations is because they are calling them potentially tens of thousands of times per second and every microsecond could be valuable.
If you're writing a program and just need to call it a couple times a second the built-in functions are fast enough by far.
I would recommend writing a test program and timing them for yourself. Yes, they're slow compared to plus and minus, but they're still single processor instructions. It's unlikely to be an issue unless you're doing a very tight loop with millions of iterations.
Yes, (relative to other mathematical operations multiply, divide): if you're doing something realtime (matrix ops, video games, whatever), you can knock off lots of cycles by moving your trig calculations out of your inner loop.
If you're not doing something realtime, then no, they're not expensive (relative to operations such as reading a bunch of data from disk, generating a webpage, etc.). Trig ops are hopefully done in hardware by your CPU (which can do billions of floating point operations per second).
If you always know the angles you are computing, you can store them in a variable instead of calculating them every time. This also applies within your method/function call where your angle is not going to change. You can be smart by using some formulas (calculating sin(theta) from sin(theta/2), knowing how often the values repeat - sin(theta + 2*pi*n) = sin(theta)) and reducing computation. See this wikipedia article
yes it is. trig functions are computed by summing up a series. So in general terms, it would be a lot more costly then a simple mathematical operation. same goes for sqrt

Efficiency/speed for trigonometric functions

In a game I'm making, I've got two points, pt1 and pt2, and I want to work out the angle between them. I've already worked out the distance, in an earlier calculation. The obvious way would be to arctan the horizontal distance over the vertical distance (tan(theta) = opp/adj).
I'm wondering though, as I've already calculated the distance, would it be quicker to use arcsine/arccosine with the distance and dx or dy?
Also, might I be better off pre-calculating in a table?
I suspect there's a risk of premature optimization here. Also, be careful about your geometry. Your opposite/adjacent approach is a property of right angle triangles, is that what you actually have?
I'm assuming your points are planar, and so for the general case you have them implicitly representing two vectors form the origin (call these v1 v2), so your angle is
theta=arccos(dot(v1,v2)/(|v1||v2|)) where |.| is vector length.
Making this faster (assuming the need) will depend on a lot of things. Do you know the vector lengths, or have to compute them? How fast can you do a dot product in your architecture. How fast is acos? At some point tricks like table lookup (probably interpolated) might help but that will cost you accuracy.
It's all trade-offs though, there really isn't a general answer to your question.
[edit: added commentary]
I'd like to re-emphasize that often playing "x is fastest" is a bit of a mugs game with modern cpus and compilers anyway. You won't know until you measure it and grovel the generated code. When you hit the point that you really care about it at this level for a (hopefully small) piece of code, you can find out in detail what your system is doing. But it's painstaking. Maybe a table is good. But maybe you've got fast vector computations and a small cache. etc. etc. etc. It all amounts to "it depends". Sorry 'bout that. On the other hand, if you haven't reached the point that you really care so much about this bit of code... you probably shouldn't be thinking about it at this level at all. Make it right. Make it clean (which means abstraction as well as code). Then worry about the overhead.
Aside from all of the wise comments regarding premature optimization, let's just assume this is the hotspot and do a frigg'n benchmark:
Times are in nanoseconds, scaled to normalize 'acos' between the systems.
'acos' simply assumes unit radius i.e. acos(adj), whereas 'acos+div' means acos(adj/hyp).
System 1 is a 2.4GHz i5 running Mac OS X 10.6.4 (gcc 4.2.1)
System 2 is a 2.83GHz Core2 Quad running Red Hat 7 Linux 2.6.28 (gcc 4.1.2)
System 3 is a 1.66GHz Atom N280 running Ubuntu 10.04 2.6.32 (gcc 4.4.3)
System 4 is a 2.40GHz Pentium 4 running Ubuntu 10.04 2.6.32 (gcc 4.4.3)
Summary: Relative performance is all over the map. Sometimes atan2 is faster, sometimes its slower. Very strangely, on some systems doing acos with a division is faster than doing it without. Test on your own system :-/
If you're going to be doing this many times, pre-calculate in a table. Performance will be much better this way.
Tons of good answers here.
By the way, if you use Math.atan2, you get a full 2π of angles out of it.
I would just do it, then run it flat out. If you don't like the speed, and if samples show that you're actually in that code most of the time and not someplace else,
try replacing it with table lookup. If you don't need precision closer than 1 degree, you could use a pretty small table and interpolation.
Also, you may want to memoize the function. Why recompute something you already did recently?
Added: If you use a table, it only has to cover angles from 0-45 degrees (and it can be hard-coded). You can get everything else by symmetry.
From a pure speed standpoint, a precalculated table and a closest-match lookup would be best. It involves some overhead, of course, depending on how fine-grained you need the angle to be, but it's more than worth it if you're doing this calculation a lot (or in a tight loop), as those are going to be expensive calculations.
Get it right first !
And then profile and optimize. Table lookup is a good candidate for sure, but be sure to have your calculation right before doing anything fancy
If you're interested in big-O notation, all the methods you might use are O(1).
If you're interested in what works fastest, test it. Write a wrapper function, one that calls your preferred method but can be easily changed, and test with that. Make sure that your application spends a noticeable amount of time doing this, so you aren't wasting your own time. Try whatever ways occur to you. Ideally, run it on more than one different CPU.
I've become very leery of predicting what will take more or less time on modern processors. Lookup tables used to be the answer if you needed speed, but you don't know a priori the effects on caching or how long it's going to take to normalize and look up versus how long it's going to take to do a trig function on a particular CPU.
Given that this is for a game, you probably care about speed. A lookup table is definitely the fastest but you trade accuracy for speed with this method. So how accurate must you be to meet requirements? Only you can answer that. Before you trade accuracy, determine first if you have a speed problem. All of the trigonometric functions are calculated using numerical methods (research numerical analysis to learn more). Some trig functions are have more expensive methods than others because they rely on series that converge more slowly and who knows, your computer may have different implementations for these functions than another computer. At any rate, you can find out for yourself how expensive these functions are by writing some small programs that loop through as many iterations as you desire, with increments of your choosing, all the while timing the outcomes. Then you can pick the fastest method.
While others are very right to mention that you are almost certainly falling into the pit of premature optimization, when they say that trigonometric functions are O(1) they're not telling the whole story.
Most trigonometric function implementations are actually O(N) in the value of the input function. This is because the trig functions are most efficiently calculated on a small interval like [0, 2π) (or, for the best implementations, even smaller parts of this interval, but that one suffices to explain things). So the algorithm looks something like this, in pseudo-Python:
def Cosine_0to2Pi(x):
#a series approximation of some kind, or CORDIC, or perhaps a table
#this function requires 0 <= x < 2Pi
def MyCosine(x):
if x < 0:
x = -x
while x >= TwoPi:
x -= TwoPi
return Cosine_0to2Pi(x)
Even microcoded CPU instructions like the x87's FSINCOS end up doing something like this internally. So trig functions, because they are periodic, usually take O(N) time to do the argument reduction. There are two caveats, however:
If you have to calculate a ton of values off the principal domain of the trig functions, your math is probably not very well thought out.
Big-O notation hides a constant factor. Argument reduction has a very small constant factor, because it's simple to do. Thus the O(1) part is going to dominate the O(N) part for just about every input.

Resources