0.0 / 0.0 = -NAN? - format

When exception masks are all enabled to stop the processor from throwing floating point exceptions the following calculation produces a somewhat strange result:
0.0 / 0.0 = -nan
(At least in Delphi).
For now I will assume this is the case in C/C++ as well and with that I mean on x86/x64 which should and seems to be following IEEE 754 floating-point format.
I am a little bit surprised by this and I want/need to know more. Where is this defined that 0.0 / 0.0 should be -NAN ?!?
Problem is with the code, example:
T := 0;
D := 0.0 / 0.0;
P := T * D;
This screws up P. instead of P being zero, P is now also -NAN ?!?
I find this very strange but ok.
I guess a simple solution could be to set D to 0 explicitly for this case, is there perhaps another solution ? Maybe some kind of mask or rounding mode so that additional branch is not necessary ???
In delphi it even depends on the platform.
Win32: 0.0 / 0.0 = -NAN
Win64: 0.0 / 0.0 = -1, #IND
Win32 sets FPU exception mask
Win64 sets SSE exception mask
(I believe I read somewhere that there are slight differences in FPU vs SSE floating point operations, but can't recall right now... something to maybe further look into... for now my tests have shown that the code I am working on currently works on both platforms...)
Just to be clear 0.0 for me at least simply means a 0 floating point number.
Though there is also +0.0 and -0.0 I think... and also -Infinity and +Infinity.
I would like to find some kind of specification or definition that goes deeper/explains why 0/0 produces these values ?!
Also these values seem to be "infectious" like a virus lol, it seems to be "propagated" through operations, so far multiplication, addition and comparison have been observed/witnessed ! ;)
P.S.: Keep your floating point issues out of my computer programs ! =D (Will SmITH meme =D)

Related

Why is gcc evaluating floating point fractional #defines as zero?

How is (float)_micron_to_meter equal to zero? Given:
//convert microns to meters via multiplication
#define _micron_to_meter (1 / 1000000)
And yet
printf("factor: %f\n", (float)_micron_to_meter);
prints out 0.000000.
Oh.. duh... I can't believe I forgot this. Answer below. I hesitate to post this, but I did pretty extensive searches here and couldn't find any other post about it. And there must be some other programmer who will be as clueless as I was being here. If you find another, feel free to mark this as a duplicate, but check why that post wasn't found?
Note to any future C programmers (and I /knew/ this, I'm not sure how I missed it). If you have a fractional define, made up of integers, e.g.
//convert microns to meters via multiplication
#define _micron_to_meter (1 / 1000000)
then the value is zero. It's already zero before it makes it out of the (). Because 1 is assumed to be an integer, and (int)1 divided by anything greater than 0 is less than 1 and integers can only go to 0 from 1. If you do this:
#define _micron_to_meter (1.0 / 1000000)
Then it works as expected, because 1.0 is assumed to be floating point.

NodeMCU Integer vs. Float Firmware what is different?

I was asking myself what are the differences between integer and float firmware and how deal with them. All I've been able to find so far is:
the integer version which supports only integer operations and the float version which contains support for floating point calculations
Ok, so far so good, but wat does this mean in real life?
What happens, when I calculate
a = 3/2
For the float version I'd expect a = 1.5
For the integer version I'd expect a = 1. Or will a equal 2 or does it throw an error or crash or something else? I know, I've could simply flash the integer version and give it a try, but I'd also like to discuss it have it answered here. :)
What other limitations/differences exist? The main reason I am asking: I tried to run some scripts on integer version without any float operations I am aware of and some functionality simply isn't there. With the float version it works as expected.
Update:
Here ist the snippet that produces an unexpected result:
local duration = (now - eventStart)
duration is 0 with integer firmware. I'd guess it is because now an eventStart are too large for integer:
now: 1477651622514913
eventStart: 1477651619238587
So I'd say other limitations are that the integer version only supports integer operations with 31 bit values because when I convert
now = tonumber(now)
now = 2147483647 which is 2^31 - 1
so in integer firmware
1477651622514913 - 1477651619238587 = 0
is the same as
2147483647 - 2147483647
which is obviously 0
NodeMCU developer FAQ says: "integer builds have a smaller Flash footprint and execute faster, but working in integer also has a number of pitfalls"
You've found some of the pitfalls with the 32-bit signed int number size limits.
No idea what others there might be, individual software modules may have their own.
"smaller": usually around 13kB of difference on custom 1.5.4.1final builds totaling 369-478kB
"faster": here is a comparison of integer and floating point operations, with the benchmark source if you'd like to run your own. Overall: int operations are almost 8 times faster. The difference might be smaller when the floats are whole numbers.
You gave the answer to your question yourself. The integer version does not support floating point operations nor does it allow non-integer numbers.
In the integer version 3/2 is 1 rather than 1.5.
I've could simply flash the integer version and give it a try, but I'd also like to discuss it. :)
Stack Overflow is a Q&A site and is thus not well suited for discussions. Use the NodeMCU forums on esp8266.com for that.

How to make my Haskell program faster? Comparison with C

I'm working on an implementation of one of the SHA3 candidates, JH. I'm at the point where the algorithm pass all KATs (Known Answer Tests) provided by NIST, and have also made it an instance of the Crypto-API. Thus I have began looking into its performance. But I'm quite new to Haskell and don't really know what to look for when profiling.
At the moment my code is consistently slower then the reference implementation written in C, by a factor of 10 for all input lengths (C code found here: http://www3.ntu.edu.sg/home/wuhj/research/jh/jh_bitslice_ref64.h).
My Haskell code is found here: https://github.com/hakoja/SHA3/blob/master/Data/Digest/JHInternal.hs.
Now I don't expect you to wade through all my code, rather I would just want some tips on a couple of functions. I have run some performance tests and this is (part of) the performance file generated by GHC:
Tue Oct 25 19:01 2011 Time and Allocation Profiling Report (Final)
main +RTS -sstderr -p -hc -RTS jh e False
total time = 6.56 secs (328 ticks # 20 ms)
total alloc = 4,086,951,472 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
roundFunction Data.Digest.JHInternal 28.4 37.4
word128Shift Data.BigWord.Word128 14.9 19.7
blockMap Data.Digest.JHInternal 11.9 12.9
getBytes Data.Serialize.Get 6.7 2.4
unGet Data.Serialize.Get 5.5 1.3
sbox Data.Digest.JHInternal 4.0 7.4
getWord64be Data.Serialize.Get 3.7 1.6
e8 Data.Digest.JHInternal 3.7 0.0
swap4 Data.Digest.JHInternal 3.0 0.7
swap16 Data.Digest.JHInternal 3.0 0.7
swap8 Data.Digest.JHInternal 1.8 0.7
swap32 Data.Digest.JHInternal 1.8 0.7
parseBlock Data.Digest.JHInternal 1.8 1.2
swap2 Data.Digest.JHInternal 1.5 0.7
swap1 Data.Digest.JHInternal 1.5 0.7
linearTransform Data.Digest.JHInternal 1.5 8.6
shiftl_w64 Data.Serialize.Get 1.2 1.1
Detailed breakdown omitted ...
Now quickly about the JH algorithm:
It's a hash algorithm which consists of a compression function F8, which is repeated as long as there exists input blocks (of length 512 bits). This is just how the SHA-functions operate. The F8 function consists of the E8 function which applies a round function 42 times. The round function itself consists of three parts:
a sbox, a linear transformation and a permutation (called swap in my code).
Thus it's reasonable that most of the time is spent in the round function. Still I would like to know how those parts could be improved. For instance: the blockMap function is just a utility function, mapping a function over the elements in a 4-tuple. So why is it performing so badly? Any suggestions would be welcome, and not just on single functions, i.e. are there structural changes you would have done in order to improve the performance?
I have tried looking at the Core output, but unfortunately that's way over my head.
I attach some of the heap profiles at the end as well in case that could be of interest.
EDIT :
I forgot to mention my setup and build. I run it on a x86_64 Arch Linux machine, GHC 7.0.3-2 (I think), with compile options:
ghc --make -O2 -funbox-strict-fields
Unfortunately there seems to be a bug on the Linux plattform when compiling via C or LLVM, giving me the error:
Error: .size expression for XXXX does not evaluate to a constant
so I have not been able to see the effect of that.
Switch to unboxed Vectors (from Array, used for constants)
Use unsafeIndex instead of incurring the bounds check and data dependency from safe indexing (i.e. !)
Unpack Block1024 as you did with Block512 (or at least use UnboxedTuples)
Use unsafeShift{R,L} so you don't incur the check on the shift value (coming in GHC 7.4)
Unfold the roundFunction so you have one rather ugly and verbose e8 function. This was significat in pureMD5 (the rolled version was prettier but massively slower than the unrolled version). You might be able to use TH to do this and keep the code smallish. If you do this then you'll have no need for constants as these values will be explicit in the code and result in a more cache friendly binary.
Unpack your Word128 values.
Define your own addition for Word128, don't lift Integer. See LargeWord for an example of how this can be done.
rem not mod
Compile with optimization (-O2) and try llvm (-fllvm)
EDIT: And cabalize your git repo along with a benchmark so we can help you easier ;-). Good work on including a crypto-api instance.
The lower graph shows that a lot of memory is occupied by lists. Unless there are more lurking in other modules, they can only come from e8. Maybe you'll have to bite the bullet and make that a loop instead of a fold, but for starters, since Block1024 is a pair, the foldl' doesn't do much evaluation on the fly (unless the strictness analyser has become significantly better). Try making that stricter, data Block1024 = B1024 !Block512 !Block512, perhaps it also needs {-# UNPACK #-} pragmas. In roundFunction, use rem instead of mod (this will only have minor impact, but it's a bit faster) and make the let bindings strict. In the swapN functions, you might get better performance giving the constants in the form W x y rather than as 128-bit hex numbers.
I can't guarantee those changes will help, but that's what looks most promising after a short glance.
Ok, so I thought I would chime in with an update of what I have done and the results obtained thus far. Changes made:
Switched from Array to UnboxedArray (made Word128 an instance type)
Used UnboxedArray + fold in e8 instead of lists and (prelude) fold
Used unsafeIndex instead of !
Changed type of Block1024 to a real datatype (similiar to Block512), and unpacked its arguments
Updated GHC to version 7.2.1 on Arch Linux, thus fixing the problem with compiling via C or LLVM
Switched mod to rem in some places, but NOT in roundFunction. When I do it there, the compile time suddenly takes an awful lot of time, and the run time becomes 10 times slower! Does anyone know why that may be? It is only happening with GHC-7.2.1, not GHC-7.0.3
I compile with the following options:
ghc-7.2.1 --make -O2 -funbox-strict-fields main.hs ./Tests/testframe.hs -fvia-C -optc-O2
And the results? Roughly 50 % reduction in time. On an input of ~107 MB, the code now use 3 minutes as compared to the previous 6-7 minutes. The C version uses 42 seconds.
Things I tried, but which didn't result in better performance:
Unrolled the e8 function like this:
e8 !h = go h 0
where go !x !n
| n == 42 = x
| otherwise = go h' (n + 1)
where !h' = roundFunction x n
Tried breaking up the swapN functions to use the underlying Word64' directly:
swap1 (W xh hl) =
shiftL (W (xh .&. 0x5555555555555555) (xl .&. 0x5555555555555555)) 1
.|.
shiftR (W (xh .&. 0xaaaaaaaaaaaaaaaa) (xl .&. 0xaaaaaaaaaaaaaaaa)) 1
Tried using the LLVM backend
All of these attempts gave worse performance than what I have currently. I don't know if thats because I'm doing it wrong (especially the unrolling of e8), or because they just are worse options.
Still I have some new questions with these new tweaks.
Suddenly I have gotten this peculiar bump in memory usage. Take a look at following heap profiles:
Why has this happened? Is it because of the UnboxedArray? And what does SYSTEM mean?
When I compile via C I get the following warning:
Warning: The -fvia-C flag does nothing; it will be removed in a future GHC release
Is this true? Why then, do I see better performance using it, rather than not?
It looks like you did a fair amount of tweaking already; I'm curious what the performance is like without explicit strictness annotations (BangPatterns) and the various compiler pragmas (UNPACK, INLINE)... Also, a dumb question: what optimization flags are you using?
Anyway, two suggestions which may be completely awful:
Use unboxed primitive types where you can (e.g. replace Data.Word.Word64 with GHC.Word.Word64#, make sure word128Shift is using Int#, etc.) to avoid heap allocation. This is, of course, non-portable.
Try Data.Sequence instead of []
At any rate, rather than looking at the Core output, try looking at the intermediate C files (*.hc) instead. It can be hard to wade through, but sometimes makes it obvious where the compiler wasn't quite as sharp as you'd hoped.

Can someone explain to me NaN in Ruby?

I just found a bug in some number manipulations in my program and I'm getting a FloatDomainError (NaN)
So I started logging the number passed in with:
if(metric.is_a?(Numeric))
self.metric = metric
else
LOGGER.warn("metric #{metric} is not a number")
self.metric=0
end
But the number being passed in is NaN which apparently is_a?(Numeric) as I don't get my log warning, and it passes metric on to my metric= method, which is where I get my FloatDomainError
Now, correct me if I'm wrong, but doesn't it seem semantically wrong to have an NaN (Not A Number) be of type Numeric ?? Can someone explain this to me?
BTW using Jruby-1.4.1
I think that making NaN a number makes perfectly sense...
try 0.0 / 0.0 in irb -> the result is NaN (which in this case is infinity)
Infinity is mathematically kind of a number, but still, you can't express it with a datatype... in math you use a special symbol too...
PS: You can use metric.nan? to check it... then it should work as you expect...
IEEE 754 floating point numbers define -INFINITY +INFINITY and NotANumber to make it possible to react to lets say division by zero. you can also calculate with these for eg 2 + INF = INF
NaN isn't a uniqe ruby feature, they are numeric in java, c++, ... too

gcc precision bug?

I can only assume this is a bug. The first assert passes while the second fails:
double sum_1 = 4.0 + 6.3;
assert(sum_1 == 4.0 + 6.3);
double t1 = 4.0, t2 = 6.3;
double sum_2 = t1 + t2;
assert(sum_2 == t1 + t2);
If not a bug, why?
This is something that has bitten me, too.
Yes, floating point numbers should never be compared for equality because of rounding error, and you probably knew that.
But in this case, you're computing t1+t2, then computing it again. Surely that has to produce an identical result?
Here's what's probably going on. I'll bet you're running this on an x86 CPU, correct? The x86 FPU uses 80 bits for its internal registers, but values in memory are stored as 64-bit doubles.
So t1+t2 is first computed with 80 bits of precision, then -- I presume -- stored out to memory in sum_2 with 64 bits of precision -- and some rounding occurs. For the assert, it's loaded back into a floating point register, and t1+t2 is computed again, again with 80 bits of precision. So now you're comparing sum_2, which was previously rounded to a 64-bit floating point value, with t1+t2, which was computed with higher precision (80 bits) -- and that's why the values aren't exactly identical.
Edit So why does the first test pass? In this case, the compiler probably evaluates 4.0+6.3 at compile time and stores it as a 64-bit quantity -- both for the assignment and for the assert. So identical values are being compared, and the assert passes.
Second Edit Here's the assembly code generated for the second part of the code (gcc, x86), with comments -- pretty much follows the scenario outlined above:
// t1 = 4.0
fldl LC3
fstpl -16(%ebp)
// t2 = 6.3
fldl LC4
fstpl -24(%ebp)
// sum_2 = t1+t2
fldl -16(%ebp)
faddl -24(%ebp)
fstpl -32(%ebp)
// Compute t1+t2 again
fldl -16(%ebp)
faddl -24(%ebp)
// Load sum_2 from memory and compare
fldl -32(%ebp)
fxch %st(1)
fucompp
Interesting side note: This was compiled without optimization. When it's compiled with -O3, the compiler optimizes all of the code away.
You are comparing floating point numbers. Don't do that, floating point numbers have inherent precision error in some circumstances. Instead, take the absolute value of the difference of the two values and assert that the value is less than some small number (epsilon).
void CompareFloats( double d1, double d2, double epsilon )
{
assert( abs( d1 - d2 ) < epsilon );
}
This has nothing to do with the compiler and everything to do with the way floating point numbers are implemented. here is the IEEE spec:
http://www.eecs.berkeley.edu/~wkahan/ieee754status/IEEE754.PDF
I've duplicated your problem on my Intel Core 2 Duo, and I looked at the assembly code. Here's what's happening: when your compiler evaluates t1 + t2, it does
load t1 into an 80-bit register
load t2 into an 80-bit register
compute the 80-bit sum
When it stores into sum_2 it does
round the 80-bit sum to a 64-bit number and store it
Then the == comparison compares the 80-bit sum to a 64-bit sum, and they're different, primarily because the fractional part 0.3 cannot be represented exactly using a binary floating-point number, so you are comparing a 'repeating decimal' (actually repeating binary) that has been truncated to two different lengths.
What's really irritating is that if you compiler with gcc -O1 or gcc -O2, gcc does the wrong arithmetic at compile time, and the problem goes away. Maybe this is OK according to the standard, but it's just one more reason that gcc is not my favorite compiler.
P.S. When I say that == compares an 80-bit sum with a 64-bit sum, of course I really mean it compares the extended version of the 64-bit sum. You might do well to think
sum_2 == t1 + t2
resolves to
extend(sum_2) == extend(t1) + extend(t2)
and
sum_2 = t1 + t2
resolves to
sum_2 = round(extend(t1) + extend(t2))
Welcome to the wonderful world of floating point!
When comparing floating point numbers for closeness you usually want to measure their relative difference, which is defined as
if (abs(x) != 0 || abs(y) != 0)
rel_diff (x, y) = abs((x - y) / max(abs(x),abs(y))
else
rel_diff(x,y) = max(abs(x),abs(y))
For example,
rel_diff(1.12345, 1.12367) = 0.000195787019
rel_diff(112345.0, 112367.0) = 0.000195787019
rel_diff(112345E100, 112367E100) = 0.000195787019
The idea is to measure the number of leading significant digits the numbers have in common; if you take the -log10 of 0.000195787019 you get 3.70821611, which is about the number of leading base 10 digits all the examples have in common.
If you need to determine if two floating point numbers are equal you should do something like
if (rel_diff(x,y) < error_factor * machine_epsilon()) then
print "equal\n";
where machine epsilon is the smallest number that can be held in the mantissa of the floating point hardware being used. Most computer languages have a function call to get this value. error_factor should be based on the number of significant digits you think will be consumed by rounding errors (and others) in the calculations of the numbers x and y. For example, if I knew that x and y were the result of about 1000 summations and did not know any bounds on the numbers being summed, I would set error_factor to about 100.
Tried to add these as links but couldn't since this is my first post:
en.wikipedia.org/wiki/Relative_difference
en.wikipedia.org/wiki/Machine_epsilon
en.wikipedia.org/wiki/Significand (mantissa)
en.wikipedia.org/wiki/Rounding_error
It may be that in one of the cases, you end up comparing a 64-bit double to an 80-bit internal register. It may be enlightening to look at the assembly instructions GCC emits for the two cases...
Comparisons of double precision numbers are inherently inaccurate. For instance, you can often find 0.0 == 0.0 returning false. This is due to the way the FPU stores and tracks numbers.
Wikipedia says:
Testing for equality is problematic. Two computational sequences that are mathematically equal may well produce different floating-point values.
You will need to use a delta to give a tolerance for your comparisons, rather than an exact value.
This "problem" can be "fixed" by using these options:
-msse2 -mfpmath=sse
as explained on this page:
http://www.network-theory.co.uk/docs/gccintro/gccintro_70.html
Once I used these options, both asserts passed.

Resources