I was going through the go tutorial on golang.org and I came across an example that i partially understand...
MaxInt uint64 = 1<<64 - 1
Now I understand this to be shifting the bit 64 places to the left which would make it a 1 followed by 64 0's.
My question is why is this the max integer that can be achieved in a 64 bit number. Wouldn't the max integer be 111111111....(until the 64th 1) instead of 100000...(until the 64th one)?
What happens here, step by step:
Take 1.
Shift it to the left 64 bits. This is tricky. The result actually needs 65 bits for representation - namely 1 followed by 64 zeroes. Since we are calculating a 64 bit value here why does this even compile instead of overflowing to 0 or 1 or producing a compile error?
It works because the arithmetic used to calculate constants in Go is a bit magic (https://blog.golang.org/constants) in that it has nothing to do whatsoever with the type of the named constant being calculated. You can say foo uint8 = 1<<415 / 1<<414 and foo is now 2.
Subtract 1. This brings us back into 64 bits numbers, as it's actually 11....1 (64 times), which is indeed the maximum value of uint64. Without this step, the compiler would complain about us trying to cram 65 bit value into uint64.
Name the constant MaxInt and give it type uint64. Success!
The magic arithmetic used to calculate constants still has limitations (obviously). Shifts greater than 500 or so produce funny named stupid shift errors.
Related
I am working with the mips assembly language but am confused on the overflow aspect of arithmetic here.
Say I am subtracting 25 from 20 and end up with -5. Would this result in an overflow?
I understand that with addition if you add 2 positive numbers or 2 negative numbers and the output is the opposite sign then there is overflow but am lost when it comes to subtraction.
Find the examples at the extreme, let's do it in 8 bits signed to make it simple but the same principles holds in 32 bits
minuend: the smallest possible positive (non-negative) number, 0 and
subtrahend: the largest possible number, 127
When we do 0 - 127, the answer is -127 and that indeed fits in 8 bits signed. There is a borrow. In any processor the effect of the borrow is to propagate 1's though out the upper bits, making it a negative number of the proper magnitude.
Different processors set flags differently based on this borrow, MIPS doesn't have flags, while x86 will set flags to indicate borrow, and other processors will set the flags to indicate carry.
In 8 bit signed numbers:
minuend: the smallest possible positive (non-negative) number, 0 and
subtrahend: the largest possible number, -128
When we do 0 - -128, the answer should be 128, but that cannot be represented in 8 bit signed format, so this is the example of overflow. 0 - -127 = 127 and that can be represented, so no overflow there.
If we do it in 8 bits unsigned, your example of 25-20 = -5, but -5 cannot be represented in an unsigned format, so this is indeed overflow (or modular arithmetic, if you like that).
Short answer: Yes, as the 32-bit representation of -5 is FFFFFFFB.
Long answer: Depends on what you mean by "overflow".
There's signed overflow, which is when you cross the 7FFFFFFF-80000000 boundary.
And there's unsigned overflow where you cross the FFFFFFFF-00000000 boundary.
For signed arithmetic, signed overflow is undeniably a bad thing (and is considered undefined behavior in C and other languages). However, unsigned overflow is not necessarily a problem. Usually it is, but many procedures rely on it to work.
For example, imagine you have a "frame timer" variable, i.e. a 32-bit counter variable that increments by 1 during an interrupt service routine. This interrupt is tied to a real-time clock running at 60 hertz, so every 1/60th of a second the variable's value increases by 1.
Now, this variable will overflow eventually. But do we really care? No. It just wraps around back to zero again. For our purposes, it's fine, since we really don't need to know that accurately how long our program has been running since it began. We probably have events that occur every n ticks of the timer, but we can just use a bitmask for that. Effectively in this case we're using unsigned overflow to say "if this value equals FFFFFFFF and we're about to add 1 to it, reset it to zero instead." Which thanks to overflow we can easily implement without any additional condition checking.
The reason I bring this up is so that you understand that overflow is not always a bad thing, if it's the unsigned variety. It depends entirely on what your data is intended to represent (which is something you can't explain even to a C compiler.)
I'm working on a problem out of Cracking The Coding Interview which requires that I swap odd and even bits in an integer with as few instructions as possible (e.g bit 0 and 1 are swapped, bits 2 and 3 are swapped, etc.)
The author's solution revolves around using a mask to grab, in one number, the odd bits, and in another num the even bits, and then shifting them off by 1.
I get her solution, but I don't understand how she grabbed the even/odd bits. She creates two bit masks --both in hex -- for a 32 bit integer. The two are: 0xaaaaaaaa and 0x55555555. I understand she's essentially creating the equivalent of 1010101010... for a 32 bit integer in hexadecimal and then ANDing it with the original num to grab the even/odd bits respectively.
What I don't understand is why she used hex? Why not just code in 10101010101010101010101010101010? Did she use hex to reduce verbosity? And when should you use one over the other?
It's to reduce verbosity. Binary 10101010101010101010101010101010, hexadecimal 0xaaaaaaaa, and decimal 2863311530 all represent exactly the same value; they just use different bases to do so. The only reason to use one or another is for perceived readability.
Most people would clearly not want to use decimal here; it looks like an arbitrary value.
The binary is clear: alternating 1s and 0s, but with so many, it's not obvious that this is a 32-bit value, or that there isn't an adjacent pair of 1s or 0s hiding in the middle somewhere.
The hexadecimal version takes advantage of chunking. Assuming you recognize that 0x0a == 0b1010, you can mentally picture the 8 groups of 1010 in the assumed value.
Another possibility would be octal 25252525252, since... well, maybe not. You can see that something is alternating, but unless you use octal a lot, it's not clear what that alternating pattern in binary is.
I’m writing a Radix-2 DIT FFT algorithm in VHDL, which requires some fractional multiplication of input data by Twiddle Factor (TF). I use Fixed Point arithmetic’s to achieve that, with every word being 16 bit long, where 1 bit is a sign bit and the rest is distributed between integer and fraction. Therefore my dilemma:
I have no idea, in what range my input data will be, so if I just decide that 4 bits go to integer and the rest 11 bits to fraction, in case I get integer numbers higher than 4 bits = 15 decimal, I’m screwed. The same applies if I do 50/50, like 7 bits to integer and the rest to fraction. If I get numbers, which are very small, I’m screwed because of truncation or rounding, i.e:
Let’s assume I have an integer "3"(0000 0011) on input and TF of "0.7071" ( 0.10110101 - 8 bit), and let’s assume, for simplicity, my data is 8 bit long, therefore:
3x0.7071 = 2.1213
3x0.7071 = 0000 0010 . 0001 1111 = 2.12109375 (for 16 bits).
Here comes the trick - I need to up/down round or truncate 16 bits to 8 bits, therefore, I get 0000 0010, i.e 2 - the error is way too high.
My questions are:
How would you solve this problem of range vs precision if you don’t know the range of your input data AND you would have numbers represented in fixed point?
Should I make a process, which decides after every multiplication where to put the comma? Wouldn’t it make the multiplication slower?
Xilinx IP Core has 3 different ways for Fixed Number Arithmetic’s – Unscaled (similar to what I want to do, just truncate in case overflow happens), Scaled fixed point (I would assume, that in that case it decides after each multiplication, where the comma should be and what should be rounded) and Block Floating Point(No idea what it is or how it works - would appreciate an explanation). So how does this IP Core decide where to put the comma? If the decision is made depending on the highest value in my dataset, then in case I have just 1 high peak and the rest of the data is low, the error will be very high.
I will appreciate any ideas or information on any known methods.
You don't need to know the fixed-point format of your input. You can safely treat it as normalized -1 to 1 range or full integer-range.
The reason is that your output will have the same format as the input. Or, more likely for FFT, a known relationship like 3 bits increase, which would the output has 3 more integer bits than the input.
It is the core user's burden to know where the decimal point will end up, you have to document the change to dynamic range of course.
I've got a question which i have solved and many other similar ones which i have successfully completed as part of my assignment. I am having a little confusion with one of the question which is...
"The decimal number -256 is held in a 9 bit memory location. represent this in sign and magnitude."
the answer that i got is... 11 0000 0000
how i got this is by doing the following:
We first show the binary form and then invert the most significant bit (the bit on the far left. O represents a positive value and 1 represents a negative value) the sign determines whether it is a positive or negative value and the magnitude is the total of the numbers or total of bits etc.
Notice that I’ve added an extra bit to show the representation of -256 in sign and magnitude. This is simply because 9 bits are not enough to show whether 256 is negative or positive as it maxed out. The total of 9 bits gave the magnitude which is 256 and the 10th bit (on the far left) shows the sign which is ‘1’ and this shows that it is a negative 256.
What i find confusing is that the decimal -256 was held in a 9 bit memory location and the result i got was 1 0000 0000. i have added an extra bit to show that the number is negative which then represents it in 10 bit. I am having difficulty with how i can represent -256 in sign and magnitude using only 9 bit. it seems impossible to show in 9 bit and i have therefore used 10 bit but i am only allowed to use 9 bits. could someone help with how this could be achieved. your help will be greatly appreciated. I am just a bit stuck with this. My tutors expect us to use the internet or self knowledge and would not give us even a clue. so that's why i am here.
Know this is a bit late but I wondered this too in my HW just now and looked it up.
The magnitude, given w-bits, of sign magnitude is 2^(w-1)-1.
The decimal equivalent of 100000000 is 256.
Given that the number of bits is 9, 2^(8)-1 = 255.
So it would be impossible to store 256 given 9 bits in sign magnitude.
It is impossible to represent -256 in sign magnitude, with 9 bits, simply because it is not possible to represent 256 in 8 bits, (8 bits of precision, affords 2^8 = 256 different possible values, so you can represent from 0 up to 255, but no further).
I am currently working on a framework which transforms C to VHDL and I am getting stuck on the implementation of the long long division. Indeed, my framework is only able to work on 32-bits variable, so parsing a C long long variable will result into 2 VHDL variables, one containing the most significant part, one containing the least significant part. So to sum up, from this :
long long a = 1LL;
The VHDL which will be generated will be something like :
var30 <= 00000000000000000000000000000000;
var31 <= 00000000000000000000000000000001;
Now my problem is : how can I divide 2 long long parameters (in VHDL), since they are splitted in 2 variables ? I had no problem for the addition/substraction, since I can work on the most (resp. least) significant part independently (just a carry to propagate), but I really don't see how I could perform a division, since with this kind of operation, the least and the most significant part are really bound together... If someone has an idea, it would be much appreciated
PS : I have the same problem for the multiplication
EDIT : I both work on signed/unsigned variables and the result should be a 64-bit variable
For both the multiplication and the division problem you can break the problem down like this: consider that each 64 bit value, x can be expressed as k*x.hi+x.lo where x.hi is the upper 32 bits, x.lo is the lower 32 bits, and k = 2^32. So for multiplication:
a*b = (a.hi*k+a.lo)*(b.hi*k+b.lo)
= a.hi*b.hi*k*k + (a.hi*b.lo + a.lo*b.hi)*k + a.lo*b.lo
If you just want a 64 bit result then the first term disappears and you get:
a*b = (a.hi*b.lo + a.lo*b.hi)*k + a.lo*b.lo
Remember that in general multiplication doubles the number of bits, so each 32 bit x 32 bit multiply in the above expressions will generate a 64 bit term. In some cases you only want the low 32 bits (first two terms in above expression) but for the last term you need both the low and high 32 bits.