Convert to long long, pros and cons in C - performance

I want to accelerate my algorithm using long long instead of double data type. My algorithm is to find the shortest path in a directed acyclic graph (DAG). Simply, it adds the weight of an edge "E: a->b" to b, and if the new weight of b is lower than the previous one, it is updated along with its parent which is set to a.
I mean, my algorithm is simply some addition and comparison operations. The weight of edges are originally of "double", is it possible for me to multiply them to a large number and cast them to "long long". If this tweak makes my program faster and worth considering. How can I handle instability problems due to rounding big double to long long.
Thanks

On i5 x64 even imul seems to about 40% faster [than double multiplication]. Integer addition should also happen in fewer cycles / better throughput. About the "inexactness" problem you should be aware that doubles can be more inexact than integers.
Calculate which numbers cause problems when converting decimal to floating point?
If you have access to the original data (e.g. decimal representation of the weights, multiplying them with a large power of ten should produce exact integers without any rounding artifacts. With long longs the only concern will be that of an overflow.
How to address possible rounding instability depends on the dynamic range of your weights, and the maximum number of iterations. E.g. if your weights are all less than 1.0 and larger than 2^-52, then multiplying with 2^52 gives exact integers with no rounding errors. Then the "instability" is determined by the possibility of an overflow. (2^12 * 2^52) >= 2^64.

Related

Precison of digital computers

I read that multiplying multiple values between 0 and 1 will significantly reduce the precision of digital computers; I want to know the basis on which such postulate is based? And does it still holds for modern-day computers?
The typical IEEE-conformant representation of fractional numbers only supports a limited number of (binary) digits. So, very often, the result of some computation isn't an exact representation of the expected mathematical value, but something close to it (rounded to the next number representable within the digits limit), meaning that there is some amount of error in most calculations.
If you do multi-step calculations, you might be lucky that the error introduced by one step is compensated by some complementary error at a later step. But that's pure luck, and statistics teaches us that the expected error will indeed increase with every step.
If you e.g. do 1000 multiplications using the float datatype (typically achieving 6-7 significant decimal digits accuracy), I'd expect the result to be correct only up to about 5 digits, and in worst case only 3-4 digits.
There are ways to do precise calculations (at least for addition, subtraction, multiplication and division), e.g. using the ratio type in the LISP programming language, but in practice they are rarely used.
So yes, doing multi-step calculations in datatypes supporting fractional numbers quickly degrades precision, and it happens with all number ranges, not only with numbers between 0 and 1.
If this is a problem for some application, it's a special skill to transform mathematical formulas into equivalent ones that can be computed with better precision (e.g. formulas with fewer intermediate steps).

What is the fastest way to get the value of e?

What is the most optimised algorithm which finds the value of e with moderate accuracy?
I am looking for a comparison between optimised approaches giving more importance to speed than high precision.
Edit: By moderate accuracy I mean upto 6-7 decimal places.
But if there is a HUGE difference in speed, then I can settle with 4-5 places.
basic datatype
As mentioned in the comments 6-7 decimal places is too small accuracy to do this by an algorithm. Instead use a constant which is the fastest way anyway for this.
const double e=2.7182818284590452353602874713527;
If FPU is involved the constant is usually stored there too... Also having single constant occupies much less space than a function that computes it ...
finite accuracy
Only once bignums are involved then has any merit to use algorithm to compute e. The algo depends on target accuracy. Again for smaller accuracies are predefined constants used:
e=2.71828182845904523536028747135266249775724709369995957496696762772407663035354759457138217852516642742746639193200305992181741359662904357290033429526059563073813232862794349076323382988075319525101901157383418793070215408914993488416750924476146066808226480016847741185374234544243710753907774499206955170189
but usually in hex format for faster and more precise manipulation:
e=2.B7E151628AED2A6ABF7158809CF4F3C762E7160F38B4DA56A784D9045190CFEF324E7738926CFBE5F4BF8D8D8C31D763DA06C80ABB1185EB4F7C7B5757F5958490CFD47D7C19BB42158D9554F7B46BCED55C4D79FD5F24D6613C31C3839A2DDF8A9A276BCFBFA1C877C56284DAB79CD4C2B3293D20E9E5EAF02AC60ACC93ECEBh
For limited/finite accuracy and best speed is the PSLQ algorithm best. My understanding is that it is algorithm to find relation between real number and integer iterations.
here is my favourite PSLQ up to 800 digits of Pi PSLQ example
arbitrary accuracy
For arbitrary or "fixed" precision you need algorithm that is with variable precision. This is what I use in my arbnum class:
e=(1+1/x)^x where x -> +infinity
If you chose x as power of 2 realize that x is just single set bit of the number and 1/x has predictable bit-width. So the e will be obtained with single division and pow. Here an example:
arbnum arithmetics_e() // e computation min(_arbnum_max_a,arbnum_max_b)*5 decimals
{ // e=(1+1/x)^x ... x -> +inf
int i; arbnum c,x;
i=_arbnum_bits_a; if (i>_arbnum_bits_b) i=_arbnum_bits_b; i>>=1;
c.zero(); c.bitset(_arbnum_bits_b-i); x.one(); x/=c; c++;
for (;!x.bitget(_arbnum_bits_b);x>>=1) c*=c; //=pow(c,x);
return c;
}
Where _arbnum_bits_a,_arbnum_bits_b is the number of bits before and after decimal point in binary. So it break down to some bit operations, one bignum division and single power by squaring. Beware that multiplication and division is not that simple with bignums and usually involves Karatsuba or worse ...
There are also polynomial approaches out there that does not require bignum arithmetics similar to compute Pi. The idea is to compute chunk of binary bits per iteration without affecting the previously computed bits (too much). They should be faster but as usual for any optimizations that depends on the implementation and HW it runs on.
For reference see Brother's formula here : https://www.intmath.com/exponential-logarithmic-functions/calculating-e.php

Nearest neighbor searches in non-metric spaces

I would like to know about nearest neighbor search algorithms when working in non-metric spaces? In particular, is there any variant of a kd-tree algorithm in this setting with provable time complexity etc?
Probably more of theoretic interest for you:
The PH-Tree is similar to a quadtree, however, it transforms floating points coordinates into a non-metric system before storing them. The PH-Tree performs all queries (including kNN queries) on the non-metric data using a non-metric distance function (you can define your own distance functions on top of that).
In terms of kNN, the PH-Tree performs on par with trees like R+Trees and usually outperforms kd-trees.
The non-metric data storage appears to have little negative, possibly even positive, effect on performance, except maybe for the (almost negligible) execution time for the transformation and distance function.
The reason that the data is transformed comes from an inherent constraint of the tree: The tree is a bit-wise trie, which means it can only store bitsequences (can be seen as integer numbers). In order to store floating point numbers in the tree, we simply use the IEEE bit representation of the floating point number and interpret it as an integer (this works fine for positive number, negative numbers are a bit more complex). Crucially, this preserves the ordering, ie. if a floating point f1 is larger than f2, then the integer representation of the bits of int(f1) is also always larger than int(f2). Trivially, this transformation allows storing floating point numbers as integers without any loss of precision(!).
The transformation is non-metric, because the leading bits (after the sign bit) of a floating point number are the exponent bits, followed by the fraction bits. Clearly, if two number differ in their exponent bits, their distance grows exponentially faster (or slower for negative exponents) compared to distances cause by differences in the fraction bits.
Why did we use a bit-wise try? If we have d dimensions, it allows an easy transformation such that we can map the n'th bit of each of the d values of a coordinate into bit string with d bits. For example, for d=60, we get a 60 bit string. Assuming a CPU register width of 64 bits, this means we can perform many operations related to queries in constant time, i.e. many operations cost just one CPU operation, independent of whether we have 3 dimensions or 60 dimensions. It's probably hard to understand what's going on from this short text, more details on this can be found here.
NMSLIB provides a library for performing Nearest Neighor Search in non-metric spaces. That Github page provides a dozen of papers to read, but not all apply to non-metric spaces.
Unfortunately, there are few theoretical results regarding the complexity of Neaest Neighbor Search for non-metric spaces and there are no comprehensive empirical evaluations.
I can onsly see some theoretical results in Effective Proximity Retrieval
by Ordering Permutations, but I am not convinced. However, I suggest you take a look.
There seem to be not many people, if any, that uses k-d trees for non-metric spaces. They seem to use VP trees, etc. densitrees are also used, as described in Near Neighbor Search in Nonmetric Spaces.
Intuitively, densitrees are a class of decorated trees that hold the points of the dataset in a way similar to the metric tree. The critical difference
lies in the nature of tree decoration; instead of having one or several real values reflecting some bounds on the triangular inequality attached to every tree node, each densitree node is associated to a particular classifier called here a density estimator.

math: scale coordinate system so that certain points get integer coordinates

this is more a mathematical problem. nonethelesse i am looking for the algorithm in pseudocode to solve it.
given is a one dimensional coordinate system, with a number of points. the coordinates of the points may be in floating point.
now i am looking for a factor that scales this coordinate system, so that all points are on fixed number (i.e. integer coordinate)
if i am not mistaken, there should be a solution for this problem as long as the number of points is not infinite.
if i am wrong and there is no analytical solution for this problem, i am interested in an algorithm that approximates the solution as close as possible. (i.e. the coordinates will look like 15.0001)
if you are interested for the concrete problem:
i would like to overcome the well known pixelsnapping problem in adobe flash, which cuts of half-pixels at the border of bitmaps if the whole stage is scaled. i would like to find out an ideal scaling factor for the stage which makes my bitmaps being placed on whole (screen-)pixel coordinates.
since i am placing two bitmaps on the stage, the number of points will be 4 in each direction (x,y).
thanks!
As suggested, you have to convert your floating point numbers to rational ones. Fix a tolerance epsilon, and for each coordinate, find its best rational approximation within epsilon.
An algorithm and definitions is outlined there in this section.
Once you have converted all the coordinates into rational numbers, the scaling is given by the least common multiple of the denominators.
Note that this latter number can become quite huge, so you may want to experiment with epsilon so that to control the denominators.
My own inclination, if I were in your situation, would be to use rational numbers not with floating point.
And the algorithms you are looking for is finding the lowest common denominator.
A floating point number is an integer, multiplied by a power of two (the power might be negative).
So, find the largest necessary power of two among your inputs, and that gives you a scale factor that will work. The power of two isn't just -1 times the exponent of the float, it's a few more than that (according to where the least significant 1 bit is in the significand).
It's also optimal, because if x times a power of 2 is an odd integer then x in its float representation was already in simplest rational form, there's no smaller integer that you can multiply x by to get an integer.
Obviously if you have a mixture of large and small values among your input, then the resulting integers will tend to be bigger than 64 bit. So there is an analytical solution, but perhaps not a very good one given what you want to do with the results.
Note that this approach treats floats as being precise representations, which they are not. You may get more sensible results by representing each float as a rational number with smaller denominator (within some defined tolerance), then taking the lowest common multiple of all the denominators.
The problem there though is the approximation process - if the input float is 0.334[*] then I can't in general be sure whether the person who gave it to me really mean 0.334, or whether it's 1/3 with some inaccuracy. I therefore don't know whether to use a scale factor of 3 and say the scaled result is 1, or use a scale factor of 500 and say the scaled result is 167. And that's just with 1 input, never mind a bunch of them.
With 4 inputs and allowed final tolerance of 0.0001, you could perhaps find the 10 closest rationals to each input with a certain maximum denominator, then try 10^4 different possibilities and see whether the resulting scale factor gives you any values that are too far from an integer. Brute force seems nasty, but you might a least be able to bound the search a bit as you go. Also "maximum denominator" might be expressed in terms of the primes present in the factorization, rather than just the number, since if you can find a lot of common factors among them then they'll have a smaller lcm and hence smaller deviation from integers after scaling.
[*] Not that 0.334 is an exact float value, but that sort of thing. Decimal examples are easier.
If you are talking about single precision floating point numbers, then the number can be expressed like this according to wikipedia:
From this formula you can deduce that you always get an integer if you multiply by 2127+23. (Actually, when e is 0 you have to use another formula for the special range of "subnormal" numbers so 2126+23 is sufficient. See the linked wikipedia article for details.)
To do this in code you will probably need to do some bit twiddling to extract the factors in the above formula from the bits in the floating point value. And then you will need some kind of support for unlimited size numbers to express the integer result of the scaling (e.g. BigInteger in .NET). Normal primitive types in most languages/platforms are typically limited to much smaller sizes.
It's really a problem in statistical inference combined with noise reduction. This is the method I'm going to try out soon. I'm assuming you're trying to get a regularly spaced 2-D grid but a similar method could work on a regularly spaced grid of 3 or more dimensions.
First tabulate all the differences and note that (dx,dy) and (-dx,-dy) denote the same displacement, so there's an equivalence relation. Group those differenecs that are within a pre-assigned threshold (epsilon) of one another. Epsilon should be large enough to capture measurement errors due to random noise or lack of image resolution, but small enough not to accidentally combine clusters.
Sort the clusters by their average size (dr = root(dx^2 + dy^2)).
If the original grid was, indeed, regularly spaced and generated by two independent basis vectors, then the two smallest linearly independent clusters will indicate so. The smallest cluster is the one centered on (0, 0). The next smallest cluster (dx0, dy0) has the first basis vector up to +/- sign (-dx0, -dy0) denotes the same displacement, recall.
The next smallest clusters may be linearly dependent on this (up to the threshold epsilon) by virtue of being multiples of (dx0, dy0). Find the smallest cluster which is NOT a multiple of (dx0, dy0). Call this (dx1, dy1).
Now you have enough to tag the original vectors. Group the vector, by increasing lexicographic order (x,y) > (x',y') if x > x' or x = x' and y > y'. Take the smallest (x0,y0) and assign the integer (0, 0) to it. Take all the others (x,y) and find the decomposition (x,y) = (x0,y0) + M0(x,y) (dx0, dy0) + M1(x,y) (dx1,dy1) and assign it the integers (m0(x,y),m1(x,y)) = (round(M0), round(M1)).
Now do a least-squares fit of the integers to the vectors to the equations (x,y) = (ux,uy) m0(x,y) (u0x,u0y) + m1(x,y) (u1x,u1y)
to find (ux,uy), (u0x,u0y) and (u1x,u1y). This identifies the grid.
Test this match to determine whether or not all the points are within a given threshold of this fit (maybe using the same threshold epsilon for this purpose).
The 1-D version of this same routine should also work in 1 dimension on a spectrograph to identify the fundamental frequency in a voice print. Only in this case, the assumed value for ux (which replaces (ux,uy)) is just 0 and one is only looking for a fit to the homogeneous equation x = m0(x) u0x.

Why is float division slow?

What are the steps in the algorithm to do floating point division?
Why is the result slower than say, multiplication?
Is it done the same way we do division by hand? By repeatedly dividing by the divisor, subtracting the result to obtain a remainder, aligning the number again and continuing till the remainder is less than a particular value?
Also, why do we gain on performance if instead of doing
a = b / c
we do
d = 1 / c
a = b * d
?
Edit:
Basically I was asking because someone asked me to distribute a value among contenders based on the assignment of weights. I did all this in integers and was later asked to convert to float, which caused a slowdown in performance. I was just interested in knowing how would C or C++ do these operations that would cause the slowness.
FPU division often basically uses Newton-Raphson (or some other algorithm) to get a reciprocal then multiplies by that reciprocal. That's why the reciprocal operation is slightly faster than the general division operation.
This HP paper (which is actually more understandable than most papers I come across talking about Newton-Raphson) has this to say about floating point division:
Floating point division and square
root take considerably longer to
compute than addition and
multiplication. The latter two are
computed directly while the former are
usually computed with an iterative
algorithm. The most common approach is
to use a division-free Newton-Raphson
iteration to get an approximation to
the reciprocal of the denominator
(division) or the reciprocal square
root, and then multiply by the
numerator (division) or input argument
(square root).
From a hardware point of view division is a iterative algorithm, and the time it takes is proportional to the number of bits. The fastest division that is currently around uses the radix4 algorithm which generates 4 bit of result per iteration. For a 32 bit divide you need 8 steps at least.
Multiplication can be done in parallel to a certain degree. Without going into detail you can break up a large multiplication into several smaller, independent ones. These multiplications can again be broken down until you're at a bit-level, or you stop earlier and use a small lookup-table in hardware. This makes the multiplication hardware heavy from a silicon real estate point of view but very fast as well. It's the classic size/speed tradeoff.
You need log2 steps to combine the parallel computed results, so a 32 bit multiply need 5 logical steps (if you go down to the minimum). Fortunately these 5 steps are a good deal simpler than the division steps (it's just additions). That means in practice multiplies are even faster.
As described in the Wikipedia article Division algorithm, there are two main aproaches to division in computers:
Slow Division
Uses the following recurrence and finds one digit per iteration:
partialRemainder[j+1] = radix * partialRemainder[j] - quotientDigit[n-(j+1)]*denominator
Fast Division
Starts with an estimation and converges on the quotient. How accurate you are depends on the number of iterations.
Newton-Raphson division (very briefly):
Calculate estimate of the reciprocal.
Compute more accurate estimates of the reciprocal.
Compute quotient by multiplying the dividend by the reciprocal.
Think of the hardware involved, and you'll understand a lot better why it takes so much longer to divide than multiply. Both operations are done down at the Floating Point Unit (FPU) level, and even in the world of integral ALUs, the division circuit is a far busier place than a multiplication circuit. I would suspect this is only more painful in the world of floating point, as now the data isn't just least to most significant digit ordered, but is instead ordered by the IEEE 754 standard.
As for the round off, it's really about wherever the signals traveling between the gates get soldered to ground; where that happens, you lose digits. Not rounding, so much as truncation.
Or were you asking about simulating floating point arithmetic using just integers?
You won't gain performance by doing
d = 1 / c
a = b * d
You probably mean:
d = 1 / c
a1 = b1 * d
a2 = b2 * d
This way the division is done only once.
Division is per se slower than multiplication, however, I don't know the details. The basic reason is that, similar to functions such as sin or sqrt, it's just mathematically more complex. IIRC, a multiplication takes about 10 cycles on an average CPU, while a division takes about 50 or more.
How it is actually done was nicely explained by John Mulder.
Float division is not much slower than integer division, but the compiler may be unable to do the same optimizations.
For example the compiler can replace integer division between 3 with a multiplication and a binary shift.
Also it can replace float division between 2.0 with a multipliation by 0.5 but it cannot replace division between 3.0 with a multiplication by 1/3.0 as 1/3.0 cannot be represented exactly using binary numbers, therefore rounding errors may change the result of the division.
As the compiler doesn't know how sensitive is your application to rounding errors (say you were doing a weather simulation, see Butterfly effect )it cannot do the optimization.

Resources