Floating point operations speed - performance

Is floating point multiplication with 0.0 faster than average fp multiplication? The same question about adding 0.0 and multiplying with 1.0.
For the question to make exact sense: Is it faster on recent Intel CPUs?

No, not on modern hardware. Modern hardware can perform all normal double precision multiplications/additions/subtractions in one or two cycles. Possible exceptions to this are denormalized numbers and special values like +/-zero, +/-infinity, and NANs. These exceptions take longer if there is a difference.
However, as with all performance related questions, truth is only in measurements. If this is important to you, measure it, then you know what to do.

Related

What is integer division heavily used for?

An analysis on https://ridiculousfish.com/blog/posts/benchmarking-libdivide-m1-avx512.html finds that the new Apple CPU has spent a lot of resources making integer division massively faster.
This is a surprising thing to do. In my experience, integer division is not really used, except in cases of dividing by a compile time constant, which can be replaced with a shift or multiply.
It was even more surprising in the discussion on https://news.ycombinator.com/item?id=27133804 where someone said
When I've been micro-optimizing performance-critical code, integer division shows up as a hot spot regularly.
Now I'm really curious: just what are people doing, that makes integer division a bottleneck? I'm trying to think where it could be used. Cases I've seen:
Floating-point emulation. But these days the only CPUs without hardware floating-point are tiny microcontrollers that also wouldn't have hardware integer division anyway.
Hash tables with the number of buckets a prime number for a little bit of extra randomness. But it's long been known that this is not the best way to do things; if you don't trust your hash function to provide enough randomness, get a better hash function.
Early 3D like the PlayStation 1 using fixed point coordinates. But everyone nowadays does floating-point 3D.
So what exactly is all this integer division being used for?

Is fixed point math faster than floating point?

Years ago, in the early 1990's, I built graphics packages that optimized calculations based on fixed point arithmetic and pre-computed tables for cos, sin, and scaled equations for sqrt and log approximation using Newton's approximation methods. These advanced techniques seemed to have become part of the graphics and built-in math processors. About 5 years ago, I took a numerical analysis class that touched on some of the old techniques. I've been coding for almost 30 years and rarely ever see those old fixed point optimizations in use, even after working on GPGPU applications for world class particle accelerator experiments. Are fixed point methods still useful, anywhere throughout the software industry anymore, or is the usefulness for that knowledge now gone forever?
Fixed point is marginally useful on platforms that do not support any kind of decimal type of their own; for example, I implemented a 24-bit fixed point type for the PIC16F series microcontrollers (more on why I chose fixed point later).
However, almost every modern CPU supports floating point at the microcode or hardware level, so there isn't much need for fixed point.
Fixed point numbers are limited in the range they can represent - consider a 64-bit(32.32) fixed point vs. a 64-bit floating point: the 64-bit fixed point number has a decimal resolution of 1/(232), while the floating point number has a decimal resolution of up to 1/(253); the fixed point number can represent values as high as 231, while the floating point number can represent numbers up to 2223. And if you need more, most modern CPUs support 80-bit floating point values.
Of course, the biggest downfall of floating point is limited precision in extreme cases - e.g. in fixed point, it would require fewer bits to represent 9000000000000000000000000000000.00000000000000000000000000000002. Of course, with floating point, you get better precision for average uses of decimal arithmetic, and I have yet to see an application where decimal arithmetic is as extreme as the above example yet also does not overflow the equivalent fixed-point size.
The reason I implemented a fixed-point library for the PIC16F rather than use an existing floating point library was code size, not speed: the 16F88 has 384 bytes of usable RAM and room for 4095 instructions total. To add two fixed point numbers of predefined width, I inlined integer addition with carry-out in my code (the fixed point doesn't move anyway); to multiply two fixed point numbers, I used a simple shift-and-add function with extended 32-bit fixed point, even though that isn't the fastest multiplication approach, in order to save even more code.
So, when I had need of only one or two basic arithmetic operations, I was able to add them without using up all of the program storage. For comparison, a freely available floating point library on that platform was about 60% of the total storage on the device. In contrast, software floating point libraries are mostly just wrappers around a few arithmetic operations, and in my experience, they are mostly all-or-nothing, so cutting the code size in half because you only need half of the functions doesn't work so well.
Fixed point generally doesn't provide much of an advantage in speed though, because of its limited representation range: how many bits would you need to represent 1.7E+/-308 with 15 digits of precision, the same as a 64-bit double? If my calculations are correct, you'd need somewhere around 2020 bits. I'd bet the performance of that wouldn't be so good.
Thirty years ago, when hardware floating point was relatively rare, very special-purpose fixed-point (or even scaled integer) arithmetic could provide significant gains in performance over doing software-based floating point, but only if the allowable range of values could be efficiently represented with scaled-integer arithmetic (the original Doom used this approach when no coprocessor was available, such as on my 486sx-25 in 1992 - typing this on an overclocked hyperthreaded Core i7 running at 4.0GHz with a GeForce card that has over 1000 independent floating point compute units, it just seems wrong somehow, although I'm not sure which - the 486, or the i7...).
Floating point is more general purpose due to the range of values it can represent, and with it implemented in hardware on both CPUs and GPUs, it beats fixed point in every way, unless you really need more than 80-bit floating point precision at the expense of huge fixed-point sizes and very slow code.
Well I code for 2 decades and my experience is there are 3 main reasons to use fixed point:
No FPU available
Fixed point is still valid for DSP,MCU,FPGA and chip design in general. Also no floating point unit can work without fixed point core unit so also all bigdecimal libs must use fixed point... Also graphics cards use fixed point a lot (normalized device coordinates).
insufficient FPU precision
if you go to astronomic computations you will very soon hit the extremes and the need of handling them. For example simple Newtonian/D'Alembert integration or atmosphere ray-tracing hits the precision barriers pretty fast on large scales and low granularity. I usually use array of floating point doubles to remedy that. For situations where the input/output range is known the fixed point is usually better choice. See some examples of hitting the FPU barrier:
Is it possible to make realistic n-body solar system simulation in matter of size and mass?
ray and ellipsoid intersection accuracy improvement
speed
Back in the old days FPU was really slow (especially on x86 architecture) due the interface and api it uses. An interrupt was generated for each FPU instruction not to mention the operands and results transfer process... So few bit-shift operations in CPU ALU was usually faster.
Nowadays is this not true anymore and the ALU and FPU speeds are comparable. For example here mine measurement of CPU/FPU operations (in small Win32 C++ app):
fcpu(0) = 3.194877 GHz // tested on first core of AMD-A8-5500 APU 3.2GHz Win7 x64 bit
CPU 32bit integer aritmetics:
add = 387.465 MIPS
sub = 376.333 MIPS
mul = 386.926 MIPS
div = 245.571 MIPS
mod = 243.869 MIPS
FPU 32bit float aritmetics:
add = 377.332 MFLOPS
sub = 385.444 MFLOPS
mul = 383.854 MFLOPS
div = 367.520 MFLOPS
FPU 64bit double aritmetics:
add = 385.038 MFLOPS
sub = 261.488 MFLOPS
mul = 353.601 MFLOPS
div = 309.282 MFLOPS
The values vary with time but in comparison between data types are almost identical. Just few years back the doubles where slower due to 2x times bigger data transfers. But there are other platforms where the speed difference may be still valid.

Extremely fast method for modular exponentiation with modulus and exponent of several million digits

As a hobby project I'm taking a crack at finding really large prime numbers. The primality tests for this contain modular exponentiation calculations, i.e. a^e mod n. Let's call this the modpow operation to keep the explanation simple. I am wanting to speed up this particular calculation.
Currently I am using GMP's mpz_pown function, but, it is kind of slow. The reason I think it's too slow, is because a function call to GMP's modpow is slower than a full-blown primality test of the software called PFGW for the same large number. (So to be clear, this is just the GMP's modpow part, not my whole custom primality testing routine I am comparing). PFGW is considered the fastest in it's field and for my use case it uses a Brillhart-Lehmer-Selfridge primality test - which also uses the modpow procedure - so it's not because of mathematical cleverness that PFGW is faster in that aspect (please correct me if I'm wrong here). It looks like the bottleneck in GMP is the modpow operation. An example runtime for numbers which have a little over 20,000 digits: GMP's modpow operation takes about 45 seconds and PFGW finishes the whole primality test (involving a modpow) in 9 seconds flat. The difference gets even more impressive with even bigger numbers. GMP uses FFT multiplication and Montgomery reduction for this test comparison, see comments on this post below.
I did some research. So far I understand that the modpow algorithm uses exponentiation by squaring, integer multiplication and modulo reduction - these all sound very familiar to me. Several helper methods could improve the running time of integer multiplication:
Montgomery multiplication
FFT multiplication
To improve the running time of the exponentiation by squaring part, one may use a signed digit representation to reduce the number of multiplications (i.e. bits are represented as 0, 1 or -1, and the bit string is represented in such a way so that it contains many more zeros than in it's original base-2 representation - this reduces the running time of exponentiation by squaring).
For optimizing the modulo part of the operation, I know of these methods:
Montgomery reduction
So here is the 150,000 dollar question: is there a software library available to do a modpow operation efficiently given a very large base, exponent and modulus? (I'm aiming for several millions of digits). If you would like to suggest an option, please try to explain the inner workings of the algorithm for the case with millions of digits for the base, modulus and exponents, as some libraries use different algorithms based on the number of digits. Basically I am looking for a library which supports the techniques mentioned above (or possibly more clever techniques) and it should perform well while running the algorithm (well, better than GMP at least). So far I've searched, found and tried GMP and PFGW, but didn't find these satisfying (PFGW is fast, but I'm just interested in the modpow operation and there is no direct programming interface to that). I'm hoping that maybe an expert in the field can suggest a library with these capabilities, as there seem to be very few that are able to handle these requirements.
Edit: made the question more concise, as it was marked too broad.
First off, re. the Answer 1 writer's comment "I do not use GMP but I suspect when they wrote they use FFT
they really mean the NTT" -- no, when GMP says "FFT' it means a floating-point FFT. IIRC they also have some NTT-based routines, but for bignum mul those are uncompetitive with FFT.
The reason a well-tuned FFT-mul beats any NTT is that the slight loss of per-word precision due to roundoff error accumulation is more than made up for by the vastly superior floating-point capabilities of modern CPU offerings, especially when one considers high-performance implementations which make use of the vector-math capabilities of CPUs such as the x86_64 family, the current iterations of which - Intel Haswell, Broadwell and Skylake - have massive vector floating-point capability. (I don't cite AMD in this regard because their AVX offerings have lagged far behind Intel's; their high-water mark was circa 2002 and since then Intel has been beating the pants off them in progressively-worse fashion each year.) The reason GMP disappoints in this area is that GMP's FFT is, relatively speaking, crap. I have great respect for the GMP coders overall, but FFT timings are FFT timings, you don't get points for effort or e.g. having a really good bignum add. Here is a paper detailing a raft of GMP FFT-mul improvements:
Pierrick Gaudry, Alex Kruppa, Paul Zimmerman: "A GMP-based Implementation of Schönhage-Strassen's Large Integer Multiplication Algorithm" [http://www.loria.fr/~gaudry/publis/issac07.pdf]
This is from 2007, but AFAIK the performance gap noted in the snippet below has not been narrowed; if anything it has widened. The paper is excellent for detailing various mathematical and algorithmic improvements which can be deployed, but let's cut to the money quote:
"A program that implements a complex floating-point FFT for integer multiplication is George Woltman’s Prime95. It is written mainly for testing large Mersenne numbers 2^p − 1 for primality in the in the Great Internet Mersenne Prime Search [24]. It uses a DWT for multiplication mod a*2^n ± c, with a and c not too large, see [17]. We compared multiplication modulo 2^2wn − 1 in Prime95 version 24.14.2 with multiplication of n-word integers using our SSA implementation on a Pentium 4 at 3.2 GHz, and on an Opteron 250 at 2.4 GHz, see Figure 4. It is plain that Prime95 beats our im- plementation by a wide margin, in fact usually by more than a factor of 10 on a Pentium 4, and by a factor between 2.5 and 3 on the Opteron."
The next few paragraphs are a raft of face-saving spin. (And again, I am personally acquainted with 2 of the 3 authors, and they are all top guys in the field of computational number theory.)
Note that the aforementioned George Woltman, whose Prime95 code has discovered all of the world-record primes since shortly after its debut 20 years ago, has made his core bignum routines available in a general API-ized form called the GWNUM library. You mentioned how much faster PFGW is than GMP for FFT-mul - that's because PFGW uses GWNUM for the core 'heavy lifting' arithmetic, that's where the 'GW' in PFGW comes from.
My own FFT implementation, which has generic-C build support but like George's uses reams of x86 vector-math assembler for high performance on that CPU family, is roughly 60-70% slower than George's on current Intel processor families. I believe that makes it the world's 2nd-fastest bignum-mul code on x86. By way of example, my code is currently running a primality test on a number with roughly 2^29 bits using a 30-Mdouble-length FFT (30*2^20 doubles); thus a little more than 17 bits per input word. Using all four of my 3.3 GHz Haswell 4670 quad's cores it takes ~90 ms per modmul.
BTW, many (if not most) of the world's top bignum-math coders hang out at mersenneforum.org, I encourage you to check it out and ask your questions to the broader (at least in this particular area) expert audience there. I appear under the same handle there as here; George Woltman appears as "Prime95', PFGW's Mark Rodenkirch goes as "rogue".
I do not use GMP at all so handle this with that in mind.
I rather use NTT instead of FFT for multiplication
it removes the rounding errors and in comparison to mine FFT implementations optimized to the same point is faster
C++ NTT implementation
C++ NTT sqr(mul) implementation
as I mentioned I do not use GMP but I suspect when they wrote they use FFT they really mean the NTT (finite field Fourier transform)
The speed difference of your test and the GMP primality test can be caused by modpow call.
if there are too much calls to it then it causes heap/stack trashing which slows things down considerably. Especially for bignums. Try to avoid heap trashing so eliminate as much data from operands and return calls as possible for frequently called functions. Also sometimes helps to eliminate the bottleneck call by directly copy the function source code into your code instead of the call (or use of macros instead) with use of local variables only.
I think GMP published their source code so finding their implementation of modpow should not be too hard. You just have to use it correctly
just to be clear
you are using big numbers like 20000+ decadic digits which means ~8.4 KBytes per number. Any return or non pointer operand means to copy that amount of data to/from heap stack. This not only takes time but also usually invalidates the CPU's CACHE which also kills performance.
Now multiply this by the algorithm iterations number and you get the Idea. Had similar problems while tweaking many of mine bignum functions and the speedup is often more than 10000% (100 times) even if there is no change in the algorithm used. Just limiting/eliminating heap/stack trashing.
So I do not think you need better modpow implementation just better usage of it but of coarse I may be wrong but without the code you are using is hard to deduce more.

Performance of different CG/GLSL/HLSL functions

There are standard libraries of shader functions, such as for Cg. But are there resources which tell you how long each takes... I'm thinking similar to how you used to be able to look up how many cycles each ASM op would take.
There are no reliable resources that will tell you how long various standard shader functions take. Not even for a particular piece of hardware.
The reason for this has to do with instruction scheduling and the way modern shader architectures work. Take a simple sin function. Let's say that the hardware has a special hardware to compute the sine of a value, so it's not manually using a Tailor series or something. However, let's also say that it takes a sequence of 4 opcodes to actually compute it. Therefore, sin would take "4 cycles".
However, all of those opcodes are scalar operations. Therefore, while they're going on, you could in fact have some 3-vector dot-products, or in the case of some hardware, 4-vector dot-products going on at the same time, on the same processor. Therefore, if the hardware has 4-vector dot-products with scalar operations, the number of cycles it takes to execute a sin and a matrix-vector multiply is... still 4.
So how much did the sin operation cost? If you take out the matrix multiply, nothing gets faster. If you take out the sin, nothing still gets faster. How much does it cost? You can't say, because the cost of a single operation is irrelevant; the only measurable quantity is the cost of the shader itself.
Ultimately, all you can do is try to build your shader reasonably and see what the performance is. Unless you have low-level debugging tools to deprocess the underlying shader assembly (and no, DX assembly isn't good enough), that's really the best you can do.

fast matrix multiplication in Matlab

I need to make a matrix/vector multiplication in Matlab of very large sizes: "A" is an 655360 by 5 real-valued matrix that are not necessarily sparse and "B" is a 655360 by 1 real-valued vector. My question is how to compute: B'*A efficiently.
I have notice a slight time improvement by computing A'*B instead, which gives a column vector. But still it is quite slow (I need to perform this operation several times in the program).
With a little bit search I found an interesting Matlab toolbox MTIMESX by James Tursa, which I hoped would improve the above matrix multiplication performance. After several trials, I can only have very marginal gains over the Matlab native matrix multiplication.
Any suggestions about how should I rewrite A'*B so that the operation is more efficient? Thanks.
Matlab's raison d'etre is doing matrix computations. I would be fairly surprised if you could significantly outperform its built-in matrix multiplication with hand-crafted tools. First of all, you should make sure your multiplication can actually be performed significantly faster. You could do this by implementing a similar multiplication in C++ with Eigen.
I have had good results with matlab matrix multiplication using the GPU
In order to avoid the transpose operation, you could try:
sum(bsxfun(#times, A, B), 2)
But I would be astonished it was faster than the direct version. See #thiton's answer.
Also look at http://www.mathworks.co.uk/company/newsletters/news_notes/june07/patterns.html to see why the column-vector-based version is faster than the row-vector-based version.
Matlab is built using fairly optimized libraries (BLAS, etc.), so you can't easily improve upon it from within Matlab. Where you can improve is to get a better BLAS, such as one optimized for your processor - this will enable better use of the caches by getting appropriately sized blocks of data from main memory. Take a look into creating your own compiled versions of ATLAS, ACML, MKL, and Goto BLAS.
I wouldn't try to solve this one particular multiplication unless it's really killing you. Changing up the BLAS is likely to lead to a happier solution, especially if you're not currently making use of multicore processors.
Your #1 option, if this is your bottleneck, is to re-examine your algorithm. See this question Optimizing MATLAB code for a great example of how choosing a different algorithm reduced runtime by three orders of magnitude.

Resources