faster, very accurate approximation for tanh

faster, very accurate approximation for tanh - approximation

I was playing with tanh and what would be very close but not as expensive as tanh, computation wise. I came up with:
2/(1+exp(-2*x))-1
It is VERY close. The biggest delta I saw was like in the 10 to the -15 range. It's still not as cheap as polynomial approximations.
Can somebody differentiate that for me?
:)

1-(2*(1/(1+exp(x*2)))) works well on limited hardware

Related

How do I know how long will an algorithm (e.g. k-means) take to run?

For example, I'm running the k-means algorithm on 1 million data points. Each point is 128-dimensional, and I want 1000 clusters. Wikipedia tells me that its complexity is n^(dk+1)log(n), where d is the number of dimensions, k number of clusters and n number of instances.
Knowing that, how can I get an estimate of how long it will run on my 8-Gb RAM, 2.6GHz Intel Core i5 MacBook Pro? What is the best way to calculate this estimate? Is there a way to calculate it theoretically or should I do a few experiments on smaller sets and see how long it takes?
I would really like to have rough estimate before I spend hours or days without knowing when it might stop.
Thanks so much for your help! I really appreciate it :).
Ps. I'm using pythons' scipy kmeans

Do some experiments. k-means is popular only because it usually runs much, much faster than that asymptotic bound would suggest.

There is so much machine-related and algorithm-related specifics which is hidden in the big-O constant that it won't be possible to estimate it (for your machine and your SciPy) theoretically.
However nothing will prevent you from finding the constant experimentally - as you said: "do a few experiments on smaller sets".

Genetic algorithm: Rule of thumb for choosing parameters to solve large solution space

I have a Genetic algorithm with individuals composed of 2000 bits, where I try to optimize 4 variables. Is there any (relatively straight forward preferably) rule of thumb to set parameters such as population size, number of generations, and mutation probability?

Simply put: no, there is no simple way to choose these numbers. Everything depends on your domain and required outcome.
Population size can be determined with an experiment relatively quickly: try 100, 1000, 10K, 100K and a million. Which one gives you a better result - go with that.
Number of generations is the hardest one to determine. Usually improvement of a best result sky-rockets in the start of the processing, then slows down almost to a halt. Usually that is a time to either stop and take the best result or change some parameters, like mutation rate. So it is up to you to decide when the result is good enough: usually the balance between time spent and rate of improvement.
During my experiments and confirmed by a scientific literature, in the start of the processing, it is recommended to have mutation rate to be a minimum (like 0.01%). And once your rate of improvement slows down, introduce more mutation to explore wider range of solutions. At one point, I increased the mutation rate to something ridiculous, like 50%. This helped to disturb the stable state of the system, but the system returned back to the stable state pretty fast and the final result was not much better than the one I had before "nuclear bomb". I came to conclusion that highest mutation (in my domain) should be no more than 5% and only when the rate of improvement is almost zero.
Hopefully this can help you a bit, but what you ask is not trivial and people write dissertations on each of the topics separately. I also recommend to read through couple articles on the topics - this will help you significantly.

Memory and time issues when dividing two matrices

I have two sparse matrices in matlab
M1 of size 9thousandx1.8million and M2 of size 1.8millionx1.8million.
Now I need to calculate the expression
M1/M2
and it took me like an hour. Is it normal? Is there any efficient way in matlab so that I can overcome this time issue. I mean it's a lot and if I make number of iterations then it will keep on taking 1 hour. Any suggestion?

A quick back-of-the-envelope calculation based on assuming some iterative method like conjugate gradient or Kaczmarz method is used, and plugging in the sizes makes me believe that an hour isn't bad.
Because of the tridiagonality the matrix that's being "inverted" (if not explicitly), both of those methods are going to take a number of instructions near "some near-unity scalar factor" times ~9000 times 1.8e6 times "the number of iterations required for convergence". The product of the two things in quotes is probably around 50 (minimum) to around 1000 (maximum). I didn't cherry pick these to make your math work, these are about what I'd expect from having done these. If you assume about 1e9 instructions per second (which doesn't account much for memory access etc.) you get around 13 minutes to around 4.5 hours.
Thus, it seems in the right range for an algorithm that's exploiting sparsity.
Might be able to exploit it better yourself if you know the structure, but probably not by much.
Note, this isn't to say that 13 minutes is achievable.
Edit: One side note, I'm not sure what's being used, but I assumed iterative methods. It's also possible that direct methods are used (like explained here). These methods can be very efficient for sparse systems if you exploit the sparsity right. It's very possible that Matlab is using these by default, but it's worth investigating what Matlab is doing in your case.
In my limited experience, iterative methods were usually preferred over direct methods as the size of the systems get large (yours is large.) Our linear systems worked out to be block tridiagonal as well, as they often do in image processing.

Single-layer Perceptron

I'm building a single-layer perceptron that has a reasonably long feature vector (30-200k), all normalised.
Let's say I have 30k features which are somewhat useful at predicting a class but then add 100 more features which are excellent predictors. The accuracy of the predictions only goes up a negligible amount. However, if I manually increase the weights on the 100 excellent features (say by 5x), the accuracy goes up several percent.
It was my impression that the nature of the training process should give better features a higher weight naturally. However, it seems like the best features are being 'drowned out' by the worse ones.
I tried running it with a larger number of iterations, but that didn't help.
How can I adapt the algorithm to better weight features in a reasonably simple way? Also, a reasonably fast way; if I had fewer features it'd be easy to just run the algorithm leaving one out at a time but it's not really feasible with 30k.

My experience with implementing perceptron based network is that it takes a lot of iterations to learn something. I believe I used each sample about 1k times to learn the xor function(when having only 4 inputs). So if you have 200k inputs it will take a lot of samples and a lot of time to train your network.
I have a few suggestions for you:
try to reduce the size of the input(try to aggregate several inputs into a single one or try to remove redundant once).
try to use each sample more times. As I said it may take a lot of iterations to learn even a simple function
Hope this helps.

Efficiency/speed for trigonometric functions

In a game I'm making, I've got two points, pt1 and pt2, and I want to work out the angle between them. I've already worked out the distance, in an earlier calculation. The obvious way would be to arctan the horizontal distance over the vertical distance (tan(theta) = opp/adj).
I'm wondering though, as I've already calculated the distance, would it be quicker to use arcsine/arccosine with the distance and dx or dy?
Also, might I be better off pre-calculating in a table?

I suspect there's a risk of premature optimization here. Also, be careful about your geometry. Your opposite/adjacent approach is a property of right angle triangles, is that what you actually have?
I'm assuming your points are planar, and so for the general case you have them implicitly representing two vectors form the origin (call these v1 v2), so your angle is
theta=arccos(dot(v1,v2)/(|v1||v2|)) where |.| is vector length.
Making this faster (assuming the need) will depend on a lot of things. Do you know the vector lengths, or have to compute them? How fast can you do a dot product in your architecture. How fast is acos? At some point tricks like table lookup (probably interpolated) might help but that will cost you accuracy.
It's all trade-offs though, there really isn't a general answer to your question.
[edit: added commentary]
I'd like to re-emphasize that often playing "x is fastest" is a bit of a mugs game with modern cpus and compilers anyway. You won't know until you measure it and grovel the generated code. When you hit the point that you really care about it at this level for a (hopefully small) piece of code, you can find out in detail what your system is doing. But it's painstaking. Maybe a table is good. But maybe you've got fast vector computations and a small cache. etc. etc. etc. It all amounts to "it depends". Sorry 'bout that. On the other hand, if you haven't reached the point that you really care so much about this bit of code... you probably shouldn't be thinking about it at this level at all. Make it right. Make it clean (which means abstraction as well as code). Then worry about the overhead.

Aside from all of the wise comments regarding premature optimization, let's just assume this is the hotspot and do a frigg'n benchmark:
Times are in nanoseconds, scaled to normalize 'acos' between the systems.
'acos' simply assumes unit radius i.e. acos(adj), whereas 'acos+div' means acos(adj/hyp).
System 1 is a 2.4GHz i5 running Mac OS X 10.6.4 (gcc 4.2.1)
System 2 is a 2.83GHz Core2 Quad running Red Hat 7 Linux 2.6.28 (gcc 4.1.2)
System 3 is a 1.66GHz Atom N280 running Ubuntu 10.04 2.6.32 (gcc 4.4.3)
System 4 is a 2.40GHz Pentium 4 running Ubuntu 10.04 2.6.32 (gcc 4.4.3)
Summary: Relative performance is all over the map. Sometimes atan2 is faster, sometimes its slower. Very strangely, on some systems doing acos with a division is faster than doing it without. Test on your own system :-/

If you're going to be doing this many times, pre-calculate in a table. Performance will be much better this way.

Tons of good answers here.
By the way, if you use Math.atan2, you get a full 2π of angles out of it.
I would just do it, then run it flat out. If you don't like the speed, and if samples show that you're actually in that code most of the time and not someplace else,
try replacing it with table lookup. If you don't need precision closer than 1 degree, you could use a pretty small table and interpolation.
Also, you may want to memoize the function. Why recompute something you already did recently?
Added: If you use a table, it only has to cover angles from 0-45 degrees (and it can be hard-coded). You can get everything else by symmetry.

From a pure speed standpoint, a precalculated table and a closest-match lookup would be best. It involves some overhead, of course, depending on how fine-grained you need the angle to be, but it's more than worth it if you're doing this calculation a lot (or in a tight loop), as those are going to be expensive calculations.

Get it right first !
And then profile and optimize. Table lookup is a good candidate for sure, but be sure to have your calculation right before doing anything fancy

If you're interested in big-O notation, all the methods you might use are O(1).
If you're interested in what works fastest, test it. Write a wrapper function, one that calls your preferred method but can be easily changed, and test with that. Make sure that your application spends a noticeable amount of time doing this, so you aren't wasting your own time. Try whatever ways occur to you. Ideally, run it on more than one different CPU.
I've become very leery of predicting what will take more or less time on modern processors. Lookup tables used to be the answer if you needed speed, but you don't know a priori the effects on caching or how long it's going to take to normalize and look up versus how long it's going to take to do a trig function on a particular CPU.

Given that this is for a game, you probably care about speed. A lookup table is definitely the fastest but you trade accuracy for speed with this method. So how accurate must you be to meet requirements? Only you can answer that. Before you trade accuracy, determine first if you have a speed problem. All of the trigonometric functions are calculated using numerical methods (research numerical analysis to learn more). Some trig functions are have more expensive methods than others because they rely on series that converge more slowly and who knows, your computer may have different implementations for these functions than another computer. At any rate, you can find out for yourself how expensive these functions are by writing some small programs that loop through as many iterations as you desire, with increments of your choosing, all the while timing the outcomes. Then you can pick the fastest method.

While others are very right to mention that you are almost certainly falling into the pit of premature optimization, when they say that trigonometric functions are O(1) they're not telling the whole story.
Most trigonometric function implementations are actually O(N) in the value of the input function. This is because the trig functions are most efficiently calculated on a small interval like [0, 2π) (or, for the best implementations, even smaller parts of this interval, but that one suffices to explain things). So the algorithm looks something like this, in pseudo-Python:
def Cosine_0to2Pi(x):
#a series approximation of some kind, or CORDIC, or perhaps a table
#this function requires 0 <= x < 2Pi
def MyCosine(x):
if x < 0:
x = -x
while x >= TwoPi:
x -= TwoPi
return Cosine_0to2Pi(x)
Even microcoded CPU instructions like the x87's FSINCOS end up doing something like this internally. So trig functions, because they are periodic, usually take O(N) time to do the argument reduction. There are two caveats, however:
If you have to calculate a ton of values off the principal domain of the trig functions, your math is probably not very well thought out.
Big-O notation hides a constant factor. Argument reduction has a very small constant factor, because it's simple to do. Thus the O(1) part is going to dominate the O(N) part for just about every input.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio