According to the go doc, %b used with floating number means:
decimalless scientific notation with exponent a power of two,
in the manner of strconv.FormatFloat with the 'b' format,
e.g. -123456p-78
As the code shows below, the program output is
8444249301319680p-51
I'm a little confused about %b in floating number, can anybody tell me how this result is calculated? Also what does p- mean?
f := 3.75
fmt.Printf("%b\n", f)
fmt.Println(strconv.FormatFloat(f, 'b', -1, 64))
The decimalless scientific notation with exponent a power of two that means follows:
8444249301319680*(2^-51) = 3.75 or 8444249301319680/(2^51) = 3.75
p-51 means 2^-51 which can also be calculated as 1/(2^51)
Nice article on Floating-Point Arithmetic.
https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html
The five rules of scientific notation are given below:
The base is always 10
The exponent must be a non-zero integer, which means it can be either positive or negative
The absolute value of the coefficient is greater than or equal to 1 but it should be less than 10
The coefficient carries the sign (+) or (-)
he mantissa carries the rest of the significant digits
p
%b scientific notation with exponent a power of two (its p)
%e scientific notation
It is worth pointing out that the %b output is particularly easy for the runtime system to generate as well, due to the internal storage format for floating point numbers.
If we ignore "denormalized" floating point numbers (we can add them back later), a floating point number is stored, internally, as 1.bbbbbb...bbb x 2exp for some set of bits ("b" here), e.g., the value four is stored as 1.000...000 <exp> 2. The value six is stored as 1.100...000 <exp> 2, the value seven is stored as 1.110...000 <exp> 2, and eight is stored as 1.000...000 <exp> 3. The value seven-and-a-half is 1.111 <exp> 2, seven and three quarters is 1.1111 <exp> 2, and so on. Each bit here, in the 1.bbbb, represents the next power of two lower than the exponent.
To print out 1.111 <exp> 2 with the %b format, we simply note that we need four 1 bits in a row, i.e., the value 15 decimal or 0xf or 1111 binary, which causes the exponent to need to be decreased by 3, so that instead of multiplying by 22 or 4, we want to multiply by 2-1 or ½. So we can take the actual exponent (2), subtract 3 (because we moved the "point" three times to print 1111 binary or 15), and hence print out the string 15p-1.
That's not what Go's %b prints though: it prints 8444249301319680p-50. This is the same value (so either one would be correct output)—but why?
Well, 8444249301319680 is, in hexadecimal, 1E000000000000. Expanded into full binary, this is 1 1110 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000. That's 53 binary digits. Why 53 binary digits, when four would suffice?
The answer to that is found in the link in Nick's answer: IEEE 754 floating point format uses a 53-digit "mantissa" or "significand" (the latter is the better term and the one I usually try to use, but you'll see the former pop up very often). That is, the 1.bbb...bbb has 52 bs, plus that forced-in leading 1. So there are always exactly 53 binary digits (for IEEE "double precision").
If we just treat this 53-binary-digit number as a decimal number, we can always print it out without a decimal point. That means we just adjust the power-of-two exponent.
In IEEE754 format, the exponent itself is already stored in "excess form", with 1023 added (for double precision again). That means that 1.111000...000 <exp> 2 is actually stored with an exponent value of 2+1023 = 1025. What this means is that to get the actual power of two, the machine code formatting the number is already going to have to subtract 1023. We can just have it subtract 52 more at the same time.
Last, because the implied 1 is always there, the internal IEEE754 number doesn't actually store the 1 bit. So to read out the value and convert it, the code internally does:
decimalPart := machineDependentReinterpretation1(&doubleprec_value)
expPart := machineDependentReinterpretation2(&doubleprec_value)
where the machine-dependent-reinterpretation simply extracts the correct bits, puts in the implied 1 bit as needed in the decimal part, subtracts the offset (1023+52) for the exponent part, and then does:
fmt.Sprint("%dp%d", decimalPart, expPart)
When printing a floating-point number in decimal, the base conversion (from base 2 to base 10) is problematic, requiring a lot of code to get the rounding right. Printing it in binary like this is much easier.
Exercises for the reader, to help with understanding this:
Compute 1.102 x 22. Note: 1.12 is 1½ decimal.
Compute 11.02 x 21. (11.02 is 3.)
Based on the above, what happens as you "slide the binary point" left and right?
(more difficult) Why can we assume a leading 1? If necessary, read on.
Why we can assume a leading 1?
Let's first note that when we use scientific notation in decimal, we can't assume a leading 1. A number might be 1.7 x 103, or 5.1 x 105, or whatever. But when we use scientific notation "correctly", the first digit is never zero. That is, we do not write 0.3 x 100 but rather 3.0 x 10-1. In this kind of notation, the number of digits tells us about the precision, and the first digit never has to be zero and generally isn't supposed to be zero. If the first digit were zero, we just move the decimal point and adjust the exponent (see exercises 1 and 2 above).
The same rules apply with floating-point numbers. Instead of storing 0.01, for instance, we just slide the binary point two over two positions and get 1.00, and decrease the exponent by 2. If we might want to have stored 11.1, we slide the binary point one position the other way and increase the exponent. Whenever we do this, the first digit always winds up being a one.
There is one big exception here, which is: when we do this, we can't store zero! So we don't do this for the number 0.0. In IEEE754, we store 0.0 as all-zero-bits (except for the sign, which we can set to store -0.0). This has an all-zero exponent, which the computer hardware handles as a special case.
Denormalized numbers: when we can't assume a leading 1
This system has one notable flaw (which isn't entirely fixed by denorms, but nonetheless, IEEE has denorms). That is: the smallest number we can store "abruptly underflows" to zero. Kahan has a 15 page "brief tutorial" on gradual underflow, which I am not going to attempt to summarize, but when we hit the minimum allowed exponent (2-1023) and want to "get smaller", IEEE lets us stop using these "normalized" numbers with the leading 1 bit.
This doesn't affect the way that Go itself formats floating point numbers, because Go just takes the entire significand "as is". All we have to do is stop inserting the 253 "implied 1" when the input value is a denormalized number, and everything else Just Works. We can hide this magic inside the machine-dependent float64 reinterpretation code, or do it explicitly in Go, whichever is more convenient.
I was doing my exam prep and I have come across a problem that ive been having issues with mainly because of the lack of info provided. The question is
b.What integer does the 16 bit word F751 represent in the LC-3?
So do we convert the base 16 to base 10 or base 2, Im not really sure how to do this problem.
Take f751 and convert to binary
1111 0111 0101 0001
The most significant bit is 1 so we know the number is negative, so take the 2s complement
0000 1000 1010 1111
And Convert to decimal -2223
The High digit is greater or equal to 8 so the number is negative.
Take the complement to F (fifteen) of each digit: f751
f give 0
7 give 8
5 give A
1 give E
08AE is the 1 complement
08AF is the 2 complement which is in decimal -2223
This prevent to convert to binary
Let's say we want to represent a signed number with 5 bits where the first bit is used for the sign (+ or -) of the number. Then the zero can be represented by two bit representations (10000 and 00000).
How is this problem solved?
Okay. There are always two bit in binary 1 or 0
And then there could be any number of bits for example 1bit to 64bit
If the question is 5-bit string then it should be XXXXX where X can be any bit(1 or 0)
First bit(sign bit) we can have either +0 and -0. (Thanks #machinery)
So if it is positive, we put 0 at first position and if it is negative, we put 1 at first position.
Four Bits
Now, we got our first bit, we are left with another 4-bits 0XXXX or 1XXXX as the question asked for 0,
the rest bit will be zero.
therefore the answer is 00000 or 10000
Look how to convert decimal to binary and binary to decimal.
I am trying to understand why the double-dabble algorithm is working, but I am not getting it.
There are a lot of great descriptions of the steps and the purpose of the algorithm, like
http://www.classiccmp.org/cpmarchives/cpm/mirrors/cbfalconer.home.att.net/download/dubldabl.txt or
http://en.wikipedia.org/wiki/Double_dabble
There are also some attempts of explanation. The best one I found is this one:
http://www.minecraftforum.net/forums/minecraft-discussion/redstone-discussion-and/340153-why-does-the-double-dabble-algorithm-work#c6
But I still feel like I am missing the connecting parts. Here is what I get:
I get, that you can convert binary numbers to decimal numbers by reading them from left to right, starting with a decimal value of 0, iterating over the digits of the binary number, adding 1 to the decimal number for every 1 you reach in the binary number, and multiplying by 2, when you move on to the next digit (as explained in the last link).
But why does this lead to the double-dabble-algorithm? We don't want to convert in decimal, we want to convert in BCD. Why are we keeping the multiplying (shifting), but we drop the adding of 1? Where is the connection?
I get, why we have to add 3, when a number in a BCD-field exceeds 4 before shifting. Because if you want that the BCD number gets multiplied by 2 if you shift it, you have to do some fixups. If you shift the BCD number 0000 1000 (8) and you want to get the double 0001 0011 (16), you have to add 3 (the half of 6) before shifting, because by just shifting you end up with 0001 0000 (10) (you're missing 6).
But why do we want this and where is the adding of 1 happening?
I guess I am just missing a little part.
Converting N to its decimal representation involves repeatedly determining R (rest) after dividing by 10. In stead of dividing by 10 you might also divide by 2 (which is just a shift) followed by a division by 5. Hence the 5. Long division means trying to subtract 5 and if that succeeds actually doing the subtraction while keeping track of that success by setting a bit in Q. Adding 3 is the same as subtracting 5 while setting the bit that will subsequently be shifted into Q. Hence the 3.
16 in binary is 10 in bcd.
We want 16 in bcd, so we add 6.
but why we add 3 and not 6?
Because adding is done before shifting, so it’s all divided by two, that’s why we add 3 when higher than 5!
I think I got it while writing this question:
Suppose you want to convert 1111 1111 from binary into BCD.
We use the method to convert the binary number to a decimal number, explained in the question, but we alter it a little bit.
We don't start with a decimal number of 0 but with a BCD number of 0000 0000 (0).
BCD binary
0000 0000 1111 1111
First we have to add 1 to the BCD-number. This can be done by a simple shift
0000 0001 1111 1110
Now we move on and want to multiply the BCD-Number by 2. In the next Step we want to add the current binary digit to the BCD-number. Both can be accomplished in one step by (again) shifting.
0000 0011 1111 1100
This works over and over again. The only situation in which this doesn't work, is when a block of the BCD-numer exceeds 4. In this case you have to do the fixup explained in the question.
After iterating through the binary number, you get the BCD-representation of the number on the left side \o/
The whole idea is to use the shifting explained in your link, but then convert the number on the left into BCD. At each stage, you are getting closer to the actual binary number on the left, but making sure that the number remains in BCD rather than binary.
When you shift in a '1', you are essentially adding it.
Take a look at the link below to get the gist of the 'add 3' argument:
https://olduino.files.wordpress.com/2015/03/why-does-double-dabble-work.pdf
i am asked to show the calculation of 11+6 using 5 bit two's complement. i find some rules confusing but have come up with the answer as shown below. please let me know if it is correct or what needs to be done if it is wrong.
two's complement of 11 is
01011
two's complement of 6 is
00110
now adding them using carriers if required:
01011 -----11 in binary
00110 -----6 in binary
10001 ---total
which in decimal is 17. is this the correct method of working out??? because my result in binary shows 10001. isn't 10001 supposed to mean -1 because the first bit in two's complement is a sign bit. please help me solve this if it is wrong. thanks for your help.
Let's explain how this works. You know that 1 is represent as 00001 in 5-bit binary representation. To get -1, a known method for (2's complement) is to :
Invert all bits, which means 11110.
Add 1 to the previous result which lead to 11111 equals to 31 in decimal base (unsigned).
Thus, 10001 is not equal to -1.
Now, let's take 11 (base 10) as an example. We know that 11 equals to 01011.
Invert all bits => 10100
add 1 => 10101 which equals to 21 in unsigned mode or -11 in signed mode.
You can then deduce that 10001 is -15 in signed mode or 17 in unsigned mode.
Also be careful, a signed 5-bit integer is bounded from -16 to 15. An unsigned 5-bit Integer is bounded from 0 to 31. In your case, the answer is 17 which means it's an unsigned integer or -15 as a signed integer.
Hope that helps you.