Why does making one enum variant an `f64` increase the size of this enum? - enums

I have created three enums that are nearly identical:
#[derive(Clone, Debug)]
pub enum Smoller {
Int(u8),
Four([u8; 4]),
Eight([u8; 8]),
Twelve([u8; 12]),
Sixteen([u8; 16]),
}
#[derive(Clone, Debug)]
pub enum Smol {
Float(f32),
Four([u8; 4]),
Eight([u8; 8]),
Twelve([u8; 12]),
Sixteen([u8; 16]),
}
#[derive(Clone, Debug)]
pub enum Big {
Float(f64),
Four([u8; 4]),
Eight([u8; 8]),
Twelve([u8; 12]),
Sixteen([u8; 16]),
}
pub fn main() {
println!("Smoller: {}", std::mem::size_of::<Smoller>()); // => Smoller: 17
println!("Smol: {}", std::mem::size_of::<Smol>()); // => Smol: 20
println!("Big: {}", std::mem::size_of::<Big>()); // => Big: 24
}
What I expect, given my understanding of computers and memory, is that these should be the same size. The biggest variant is the [u8; 16] with a size of 16. Therefore, while these enums do have a different size first variant, they have the same size of their biggest variants and the same number of variants total.
I know that Rust can do some optimizations to acknowledge when some types have gaps (e.g. pointers can collapse because we know that they won't be valid and 0), but this is really the opposite of that. I think if I were constructing this enum by hand, I could fit it into 17 bytes (only one byte being necessary for the discrimination), so both the 20 bytes and the 24 bytes are perplexing to me.
I suspect this might have something to do with alignment, but I don't know why and I don't know why it would be necessary.
Can someone explain this?
Thanks!

The size must be at least 17 bytes, because its biggest variant is 16 bytes big, and it needs an extra byte for the discriminant (the compiler can be smart in some cases, and put the discriminant in unused bits of the variants, but it can't do this here).
Also, the size of Big must be a multiple of 8 bytes to align f64 properly. The smaller multiple of 8 bigger than 17 is 24.
Similarly, Smol cannot be only 17 bytes, because its size must be a multiple of 4 bytes (the size of f32). Smoller only contains u8 so it can be aligned to 1 byte.

As mcarton mentions, this is an effect of alignment of internal fields and alignment/size rules.
Alignment
Specifically, common alignments for built-in types are:
1: i8, u8.
2: i16, u16.
4: i32, u32, f32.
8: i64, u64, f64.
Do note that I say common, in practice alignment is dictated by hardware, and on 32-bits architectures you could reasonably expect f64 to be 4-bytes aligned. Further, the alignment of isize, usize and pointers will vary based on 32-bits vs 64-bits architecture.
In general, for ease of use, the alignment of a compound type is the largest alignment of any of its fields, recursively.
Access to unaligned values is generally architecture specific; on some architecture it will crash (SIGBUS) or return erroneous data, on some it will be slower (x86/x64 not so long ago) and on others it may be just fine (newer x64, on some instructions).
Size and Alignment
In C, the size must always be a multiple of the alignment, because of the way arrays are laid out and iterated over:
Each element in the array must be at its correct alignment.
Iterating is done by incrementing the pointer by sizeof(T) bytes.
Thus the size must be a multiple of the alignment.
Rust has inherited this behavior^1 .
It's interesting to note that Swift decided to define a separate intrinsic, strideof, to represent the stride in an array, which allowed them to remove any tail-padding from the result of sizeof. It did cause some confusions, as people expected sizeof to behave like C, but allows compacting memory more efficiently.
Thus, in Swift, your enums could be represented as:
Smoller: [u8 x 16][discriminant] => sizeof 17 bytes, strideof 17 bytes, alignof 1 byte.
Smol: [u8 x 16][discriminant] => sizeof 17 bytes, strideof 20 bytes, alignof 4 bytes.
Big: [u8 x 16][discriminant] => sizeof 17 bytes, strideof 24 bytes, alignof 8 bytes.
Which clearly shows the difference between the size and the stride, which are conflated in C and Rust.
^1 I seem to remember some discussions over the possible switch to strideof, which did not come to fruition as we can see, but could not find a link to them.

I think that it is because of the alignment requirements of the inner values.
u8 has an alignment of 1, so all works as you expect, and you get a whole size of 17 bytes.
But f32 has an alignment of 4 (technically it is arch-dependent, but that is the most likely value). So even if the discriminant is just 1 byte you get this layout for Smol::Float:
[discriminant x 1] [padding x 3] [f32 x 4] = 8 bytes
And then for Smol::Sixteen:
[discriminant x 1] [u8 x 16] [padding x 3] = 20 bytes
Why is this padding really necessary? Because it is a requirement that the size of a type must be a multiple of the alignment, or else arrays of this type will misalign.
Similarly, the alignment for f64 is 8, so you get a full size of 24, that is the smallest multiple of 8 that fits all the enums.

Related

Gathering half-float values using AVX

Using AVX/AVX2 intrinsics, I can gather sets of 8 values, either 1,2 or 4 byte integers, or 4 byte floats using:
_mm256_i32gather_epi32()
_mm256_i32gather_ps()
But currently, I have a case where I am loading data that was generated on an nvidia GPU and stored as FP16 values. How can I do vectorized loads of these values?
So far, I found the _mm256_cvtph_ps() intrinsic.
However, input for that intrinsic is a __m128i value, not a __m256i value.
Looking at the Intel Intrinsics Guide, I see no gather operations that store 8 values into an _mm128i register?
How can I gather FP16 values into the 8 lanes of a __m256 register? Is it possible to vector load them as 2-byte shorts into __m256i and then somehow reduce that to a __m128i value to be passed into the conversion intrinsic? If so, I haven't found intrinsics to do that.
UPDATE
I tried the cast as suggested by #peter-cordes but I am getting bogus results from that. Also, I don't understand how that could work?
My 2-byte int values are stored in __m256i as:
0000XXXX 0000XXXX 0000XXXX 0000XXXX 0000XXXX 0000XXXX 0000XXXX 0000XXXX
so how can I simply cast to __m128i where it needs to be tightly packed as
XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX
Will the cast do that?
My current code:
__fp16* fielddensity = ...
__m256i indices = ...
__m256i msk = _mm256_set1_epi32(0xffff);
__m256i d = _mm256_and_si256(_mm256_i32gather_epi32(fielddensity,indices,2), msk);
__m256 v = _mm256_cvtph_ps(_mm256_castsi256_si128(d));
But the result doesn't seem to be 8 properly formed values. I think every 2nd one is currently bogus for me?
There is indeed no gather instruction for 16bit values so you need to gather 32 bit values and ignore one half of them (and make sure that you don't accidentally read from invalid memory). Also, _mm256_cvtph_ps() needs all input values in the lower 128 bit lane and unfortunately, there is no lane-crossing 16 bit shuffle (until AVX512).
However, assuming you have only finite input values, you could do some bit-twiddling (avoiding the _mm256_cvtph_ps()). If you load a half precision value into the upper half of a 32 bit register you can do the following operations:
SEEEEEMM MMMMMMMM XXXXXXXX XXXXXXXX // input Sign, Exponent, Mantissa, X=garbage
Shift arithmetically to the right by 3 (this keeps the sign bit where it needs to be):
SSSSEEEE EMMMMMMM MMMXXXXX XXXXXXXX
Mask away excessive sign bits and garbage at the bottom (with 0b1000'11111'11111111111'0000000000000)
S000EEEE EMMMMMMM MMM00000 00000000
This will be a valid single precision float but the exponent will be off by 112=127-15 (the difference between the biases), i.e. you need to multiply these values by 2**112 (this may be combined with any subsequent operation, you intend to do anyway later). Note that this will also convert sub-normal float16 values to the corresponding sub-normal float32 value (which are also off by a factor of 2**112).
Untested intrinsic version:
__m256 gather_fp16(__fp16 const* fielddensity, __m256i indices){
// subtract 2 bytes from base address to load data into high parts:
int32_t const* base = (int32_t const*) ( fielddensity - 1);
// Gather 32bit values.
// Be aware that this reads two bytes before each desired value,
// i.e., make sure that reading fielddensitiy[-1] is ok!
__m256i d = _mm256_i32gather_epi32(base, indices, 2);
// shift exponent bits to the right place and mask away excessive bits:
d = _mm256_and_si256(_mm256_srai_epi32(d, 3), _mm256_set1_epi32(0x8fffe000));
// scale values to compensate bias difference (could be combined with subsequent operations ...)
__m256 two112 = _mm256_castsi256_ps(_mm256_set1_epi32(0x77800000)); // 2**112
__m256 f = _mm256_mul_ps(_mm256_castsi256_ps(d), two112);
return f;
}

Strange pointer arithmetic

I came across too strange behaviour of pointer arithmetic. I am developing a program to develop SD card from LPC2148 using ARM GNU toolchain (on Linux). My SD card a sector contains data (in hex) like (checked from linux "xxd" command):
fe 2a 01 34 21 45 aa 35 90 75 52 78
While printing individual byte, it is printing perfectly.
char *ch = buffer; /* char buffer[512]; */
for(i=0; i<12; i++)
debug("%x ", *ch++);
Here debug function sending output on UART.
However pointer arithmetic specially adding a number which is not multiple of 4 giving too strange results.
uint32_t *p; // uint32_t is typedef to unsigned long.
p = (uint32_t*)((char*)buffer + 0);
debug("%x ", *p); // prints 34012afe // correct
p = (uint32_t*)((char*)buffer + 4);
debug("%x ", *p); // prints 35aa4521 // correct
p = (uint32_t*)((char*)buffer + 2);
debug("%x ", *p); // prints 0134fe2a // TOO STRANGE??
Am I choosing any wrong compiler option? Pls help.
I tried optimization options -0 and -s; but no change.
I could think of little/big endian, but here i am getting unexpected data (of previous bytes) and no order reversing.
Your CPU architecture must support unaligned load and store operations.
To the best of my knowledge, it doesn't (and I've been using STM32, which is an ARM-based cortex).
If you try to read a uint32_t value from an address which is not divisible by the size of uint32_t (i.e. not divisible by 4), then in the "good" case you will just get the wrong output.
I'm not sure what's the address of your buffer, but at least one of the three uint32_t read attempts that you describe in your question, requires the processor to perform an unaligned load operation.
On STM32, you would get a memory-access violation (resulting in a hard-fault exception).
The data-sheet should provide a description of your processor's expected behavior.
UPDATE:
Even if your processor does support unaligned load and store operations, you should try to avoid using them, as it might affect the overall running time (in comparison with "normal" load and store operations).
So in either case, you should make sure that whenever you perform a memory access (read or write) operation of size N, the target address is divisible by N. For example:
uint08_t x = *(uint08_t*)y; // 'y' must point to a memory address divisible by 1
uint16_t x = *(uint16_t*)y; // 'y' must point to a memory address divisible by 2
uint32_t x = *(uint32_t*)y; // 'y' must point to a memory address divisible by 4
uint64_t x = *(uint64_t*)y; // 'y' must point to a memory address divisible by 8
In order to ensure this with your data structures, always define them so that every field x is located at an offset which is divisible by sizeof(x). For example:
struct
{
uint16_t a; // offset 0, divisible by sizeof(uint16_t), which is 2
uint08_t b; // offset 2, divisible by sizeof(uint08_t), which is 1
uint08_t a; // offset 3, divisible by sizeof(uint08_t), which is 1
uint32_t c; // offset 4, divisible by sizeof(uint32_t), which is 4
uint64_t d; // offset 8, divisible by sizeof(uint64_t), which is 8
}
Please note, that this does not guarantee that your data-structure is "safe", and you still have to make sure that every myStruct_t* variable that you are using, is pointing to a memory address divisible by the size of the largest field (in the example above, 8).
SUMMARY:
There are two basic rules that you need to follow:
Every instance of your structure must be located at a memory address which is divisible by the size of the largest field in the structure.
Each field in your structure must be located at an offset (within the structure) which is divisible by the size of that field itself.
Exceptions:
Rule #1 may be violated if the CPU architecture supports unaligned load and store operations. Nevertheless, such operations are usually less efficient (requiring the compiler to add NOPs "in between"). Ideally, one should strive to follow rule #1 even if the compiler does support unaligned operations, and let the compiler know that the data is well aligned (using a dedicated #pragma), in order to allow the compiler to use aligned operations where possible.
Rule #2 may be violated if the compiler automatically generates the required padding. This, of course, changes the size of each instance of the structure. It is advisable to always use explicit padding (instead of relying on the current compiler, which may be replaced at some later point in time).
LDR is the ARM instruction to load data. You have lied to the compiler that the pointer is a 32bit value. It is not aligned properly. You pay the price. Here is the LDR documentation,
If the address is not word-aligned, the loaded value is rotated right by 8 times the value of bits [1:0].
See: 4.2.1. LDR and STR, words and unsigned bytes, especially the section Address alignment for word transfers.
Basically your code is like,
p = (uint32_t*)((char*)buffer + 0);
p = (p>>16)|(p<<16);
debug("%x ", *p); // prints 0134fe2a
but has encoded to one instruction on the ARM. This behavior is dependent on the ARM CPU type and possibly co-processor values. It is also highly non-portable code.
It's called "undefined behavior". Your code is casting a value which is not a valid unsigned long * into an unsigned long *. The semantics of that operation are undefined behavior, which means pretty much anything can happen*.
In this case, the reason two of your examples behaved as you expected is because you got lucky and buffer happened to be word-aligned. Your third example was not as lucky (if it was, the other two would not have been), so you ended up with a pointer with extra garbage in the 2 least significant bits. Depending on the version of ARM you are using, that could result in an unaligned read (which it appears is what you were hoping for), or it could result in an aligned read (using the most significant 30 bits) and a rotation (word rotated by the number of bytes indicated in the least significant 2 bits). It looks pretty clear that the later is what happened in your 3rd example.
Anyway, technically, all 3 of your example outputs are correct. It would also be correct for the program to crash on all 3 of them.
Basically, don't do that.
A safer alternative is to write the bytes into a uint32_t. Something like:
uint32_t w;
memcpy(&w, buffer, 4);
debug("%x ", w);
memcpy(&w, buffer+4, 4);
debug("%x ", w);
memcpy(&w, buffer+2, 4);
debug("%x ", w);
Of course, that's still assuming sizeof(uint32_t) == 4 && CHAR_BITS == 8, but that's a much safer assumption. (Ie, it should work on pretty much any machine with 8 bit bytes.)

Stacked MPI derived data types in Fortran

MPI2 allows us to create derived data types and send them by writing
call mpi_type_create_indexed_block(size,1,dspl_send,rtype,DerType,ierr)
call mpi_send(data,1,DerType,jRank,20,comm,ierr)
By doing this the position dspl_send of data(N) are sent by the MPI library.
Now, for a matrix data(M,N) we can send its position via the following code:
call mpi_type_create_indexed_block(size,M,dspl_send,rtype,DerTypeM,ierr)
call mpi_send(data,1,DerTypeM,jRank,20,comm,ierr)
That is the entries data(i, dspl_send(j)) are sent.
My question concern the role of the 1 in the subsequent mpi_send. Does it has always to be 1? Is another size possible? MPI derived data types are explained nicely in many documents on the internet, but always the size in send/recv is 1 without mention if another size is allowed and then how it could be used.
If we want to work with matrices data(M,N) with a size M that varies between calls, do we need to always create a derived data type whenever we call it? Is it impossible to use DerType for sending a matrix data(M,N) or data(N,M)?
Each MPI datatype has two properties: size and extent. The size is the actual number of bytes that the datatype represent while the extent is the number of bytes that the datatype covers in memory. Some datatypes are not contiguous, which means that their size might be less than their extent, e.g. (shown here in pseudocode)
MPI_TYPE_VECTOR(count = 1,
blocklength = 10,
stride = 20,
oldtype = MPI_INTEGER,
newtype = newtype)
creates a datatype that takes the first 10 (blocklength) elements from a total of 20 (stride). This datatype has a size of 10 times the size of MPI_INTEGER which counts to 40 bytes on most systems. Its extent is two times larger or 80 bytes on most systems. If count was 2, then it would take 10 elements, then skip the next 10, then take another 10 elements and once again skip the next 10. Consequently its size and its extend would be twice as larger.
When you specify a certain element count in any MPI routine, e.g. MPI_SEND, MPI does something like this:
It initialises the internal data buffer with the address of the source buffer argument.
It consults the datatype type map to decide how many bytes and from where to take and appends them to the message being constructed. The number of bytes added equals the size of the datatype.
It increments the internal data pointer by the extent of the datatype.
It decrements the internal count and if it is still non-zero, repeats the previous two steps.
One nifty feature of MPI is that the extent of the datatype is not required to match its size (as shown in the vector example) and one can even bestow whatever value of the extent that he wants on the datatype using MPI_TYPE_CREATE_RESIZED. This allows for very complex data access patterns to be created. For example, using MPI_SCATTERV to scatter a matrix by blocks that do not span entire rows (C) or columns (Fortran) requires the use of such resized types.
Back to the vector example. Whether you create a vector type with count = 1 and then call MPI_SEND with count = 2 or you create a vector type with count = 2 and then call MPI_SEND with count = 1, the end result is the same. Often one constructs a datatype that fully describes the object that one wants to send. In this case one gives count = 1 in the call to MPI_SEND. But there are cases when it might be more beneficial to create a datatype that describes only a portion of the object, for example a single part, and then call MPI_SEND with count set to the number of parts that one wants to send. Sometimes it is a matter of personal preferences, sometimes it is a matter of algorithmic requirements.
As to your last question, Fortran stores matrices in column-major order, which means that data(i,j) is next to data(i±1,j) in memory and not to data(i,j±1). Consequently, data(M,N) consists of N consecutive column-vectors of M elements each. The distance between two elements, for example data(1,1) and data(1,2) depends on M. That's why you supply M in the type constructor. Matrices with different number of rows (e.g. different M) would not "fit" the type map of the created type and the wrong elements would be used to construct the message.
The description about extent in https://stackoverflow.com/a/13802243/7784768 is not entirely correct, as the extent does not take into account the padding in the end of datatype. MPI datatypes are defined by typemap:
typemap = ((type_0, disp_0 ), ..., (type_n−1, disp_n−1 ))
Extent is then defined according to
lb = min(disp_j)
ub = max(disp_j + sizeof(type_j)) + e)
extent = ub - lb,
where e can be non-zero due alignment requirements.
This means that in the example
MPI_TYPE_VECTOR(count = 1,
blocklength = 10,
stride = 20,
oldtype = MPI_INTEGER,
newtype = newtype)
with count=1, typemap is
((int, 0), (int, 4), ... (int, 36))
and extent is in most systems 40 and not 80 (i.e. stride has no effect for the typemap in this case). For count=2, typemap would be
((int, 0), (int, 4), ... (int, 36), (int, 80), (int, 84), ... (int, 116))
and extent 120 (40 bytes for the first block of 10 integers, 40 bytes for the stride, and 40 bytes for the second block of 10 integers, but the remaining stride is neglected in the extent). One can easily find out the extent with the MPI_Type_get_extent function.
Extent is quite tricky concept, and it is easy to make mistakes when trying to communicate multiple elements of derived datatype.

How do I detect overflow while multiplying two 2's complement integers?

I want to multiply two numbers, and detect if there was an overflow. What is the simplest way to do that?
Multiplying two 32 bit numbers results in a 64 bit answer, two 8s give a 16, etc. binary multiplication is simply shifting and adding. so if you had say two 32 bit operands and bit 17 set in operand A and any of the bits above 15 or 16 set in operand b you will overflow a 32 bit result. bit 17 shifted left 16 is bit 33 added to a 32.
So the question again is what are the size of your inputs and the size of your result, if the result is the same size then you have to find the most significant 1 of both operands add those bit locations if that result is bigger than your results space you will overflow.
EDIT
Yes multiplying two 3 bit numbers will result in either a 5 bit number or 6 bit number if there is a carry in the add. Likewise a 2 bit and 5 bit can result in 6 or 7 bits, etc. If the reason for this question posters question is to see if you have space in your result variable for an answer then this solution will work and is relatively fast for most languages on most processors. It can be significantly faster on some and significantly slower on others. It is generically fast (depending on how it is implemented of course) to just look at the number of bits in the operands. Doubling the size of the largest operand is a safe bet if you can do it within your language or processor. Divides are downright expensive (slow) and most processors dont have one much less at an arbitrary doubling of operand sizes. The fastest of course is to drop to assembler do the multiply and look at the overflow bit (or compare one of the result registers with zero). If your processor cant do the multiply in hardware then it is going to be slow no matter what you do. I am guessing that asm is not the right answer to this post despite being by far the fastest and has the most accurate overflow status.
binary makes multiplication trivial compared to decimal, for example take the binary numbers
0b100 *
0b100
Just like decimal math in school you (can) start with the least significant bit on the lower operand and multiply it against all the locations in the upper operand, except with binary there are only two choices you multiply by zero meaning you dont have to add to the result, or you multiply by one which means you just shift and add, no actual multiplication is necessary like you would have in decimal.
000 : 0 * 100
000 : 0 * 100
100 : 1 * 100
Add up the columns and the answer is 0b10000
Same as decimal math a 1 in the hundreds column means copy the top number and add two zeros, it works the same in any other base as well. So 0b100 times 0b110 is 0b1000, a one in the second column over so copy and add a zero + 0b10000 a one in the third column over so copy and add two zeros = 0b11000.
This leads to looking at the most significant bits in both numbers. 0b1xx * 0b1xx guarantees a 1xxxx is added to the answer, and that is the largest bit location in the add, no other single inputs to the final add have that column populated or a more significant column populated. From there you need only more bit in case the other bits being added up cause a carry.
Which happens with the worst case all ones times all ones, 0b111 * 0b111
0b00111 +
0b01110 +
0b11100
This causes a carry bit in the addition resulting in 0b110001. 6 bits. a 3 bit operand times a 3 bit operand 3+3=6 6 bits worst case.
So size of the operands using the most significant bit (not the size of the registers holding the values) determines the worst case storage requirement.
Well, that is true assuming positive operands. If you consider some of these numbers to be negative it changes things but not by much.
Minus 4 times 5, 0b1111...111100 * 0b0000....000101 = -20 or 0b1111..11101100
it takes 4 bits to represent a minus 4 and 4 bits to represent a positive 5 (dont forget your sign bit). Our result required 6 bits if you stripped off all the sign bits.
Lets look at the 4 bit corner cases
-8 * 7 = -56
0b1000 * 0b0111 = 0b1001000
-1 * 7 = -7 = 0b1001
-8 * -8 = 64 = 0b01000000
-1 * -1 = 2 = 0b010
-1 * -8 = 8 = 0b01000
7 * 7 = 49 = 0b0110001
Lets say we count positive numbers as the most significant 1 plus one and negative the most significant 0 plus one.
-8 * 7 is 4+4=8 bits actual 7
-1 * 7 is 1+4=5 bits, actual 4 bits
-8 * -8 is 4+4=8 bits, actual 8 bits
-1 * -1 is 1+1=2 bits, actual 3 bits
-1 * -8 is 1+4=5 bits, actual 5 bits
7 * 7 is 4+4=8 bits, actual 7 bits.
So this rule works, with the exception of -1 * -1, you can see that I called a minus one one bit, for the plus one thing find the zero plus one. Anyway, I argue that if this were a 4 bit * 4 bit machine as defined, you would have 4 bits of result at least and I interpret the question as how may more than 4 bits do I need to safely store the answer. So this rule serves to answer that question for 2s complement math.
If your question was to accurately determine overflow and then speed is secondary, then, well it is going to be really really slow for some systems, for every multiply you do. If this is the question you are asking, to get some of the speed back you need to tune it a little better for the language and/or processor. Double up the biggest operand, if you can, and check for non-zero bits above the result size, or use a divide and compare. If you cant double the operand sizes, divide and compare. Check for zero before the divide.
Actually your question doesnt specify what size of overflow you are talking about either. Good old 8086 16 bit times 16 bit gives a 32 bit result (hardware), it can never overflow. What about some of the ARMs that have a multiply, 32 bit times 32 bit, 32 bit result, easy to overflow. What is the size of your operands for this question, are they the same size or are they double the input size? Are you willing to perform multiplies that the hardware cannot do (without overflowing)? Are you writing a compiler library and trying to determine if you can feed the operands to the hardware for speed or if you have to perform the math without a hardware multiply. Which is the kind of thing you get if you cast up the operands, the compiler library will try to cast the operands back down before doing the multiply, depending on the compiler and its library of course. And it will use the count the bit trick determine to use the hardware multiply or a software one.
My goal here was to show how binary multiply works in a digestible form so you can see how much maximum storage you need by finding the location of a single bit in each operand. Now how fast you can find that bit in each operand is the trick. If you were looking for minimum storage requirements not maximum that is a different story because involves every single one of the significant bits in both operands not just one bit per operand, you have to do the multiply to determine minimum storage. If you dont care about maximum or minimum storage you have to just do the multiply and look for non zeros above your defined overflow limit or use a divide if you have the time or hardware.
Your tags imply you are not interested in floating point, floating point is a completely different beast, you cannot apply any of these fixed point rules to floating point, they DO NOT work.
Check if one is less than a maximum value divided by the other. (All values are taken as absolute).
2's complementness hardly has anything to do with it, since the multiplication overflows if x*(2n - x)>2M, which is equal to (x*2n - x2)>2M, or x2 < (x*2n - 2M), so you'll have to compare overflowing numbers anyway (x2 may overflow, while result may not).
If your number are not from the largest integral data type, then you might just cast them up, multiply and compare with the maximum of the number's original type. E.g. in Java, when multiplying two int, you can cast them to long and compare the result to Integer.MAX_VALUE or Integer.MIN_VALUE (depending on sign combination), before casting the result down to int.
If the type already is the largest, then check if one is less than the maximum value divided by the other. But do not take the absolute value! Instead you need separate comparison logic for each of the sign combinations negneg, pospos and posneg (negpos can obviously be reduced to posneg, and pospos might be reduced to neg*neg). First test for 0 arguments to allow safe divisions.
For actual code, see the Java source of MathUtils class of the commons-math 2, or ArithmeticUtils of commons-math 3. Look for public static long mulAndCheck(long a, long b). The case for positive a and b is
// check for positive overflow with positive a, positive b
if (a <= Long.MAX_VALUE / b) {
ret = a * b;
} else {
throw new ArithmeticException(msg);
}
I want to multiply two (2's complement) numbers, and detect if there was an overflow. What is the simplest way to do that?
Various languages do not specify valid checking for overflow after it occurs and so prior tests are required.
With some types, a wider integer type may not exist, so a general solution should limit itself to a single type.
The below (Ref) only requires compares and known limits to the integer range. It returns 1 if a product overflow will occur, else 0.
int is_undefined_mult1(int a, int b) {
if (a > 0) {
if (b > 0) {
return a > INT_MAX / b; // a positive, b positive
}
return b < INT_MIN / a; // a positive, b not positive
}
if (b > 0) {
return a < INT_MIN / b; // a not positive, b positive
}
return a != 0 && b < INT_MAX / a; // a not positive, b not positive
}
Is this the simplest way?
Perhaps, yet it is complete and handle all cases known to me - including rare non-2's complement.
Alternatives to Pavel Shved's solution ...
If your language of choice is assembler, then you should be able to check the overflow flag. If not, you could write a custom assembler routine that sets a variable if the overflow flag was set.
If this is not acceptable, you can find the most signficant set bit of both values (absolutes). If the sum exceeds the number of bits in the integer (or unsigned) then you will have an overflow if they are multiplied together.
Hope this helps.
In C, here's some maturely optimized code that handles the full range of corner cases:
int
would_mul_exceed_int(int a, int b) {
int product_bits;
if (a == 0 || b == 0 || a == 1 || b == 1) return (0); /* always okay */
if (a == INT_MIN || b == INT_MIN) return (1); /* always underflow */
a = ABS(a);
b = ABS(b);
product_bits = significant_bits_uint((unsigned)a);
product_bits += significant_bits_uint((unsigned)b);
if (product_bits == BITS(int)) { /* cases where the more expensive test is required */
return (a > INT_MAX / b); /* remember that IDIV and similar are very slow (dozens - hundreds of cycles) compared to bit shifts, adds */
}
return (product_bits > BITS(int));
}
Full example with test cases here
The benefit of the above approach is it doesn't require casting up to a larger type, so the approach could work on larger integer types.

How does Google Protocol Buffers compare to ASN.1

What are the most noticable differences between Google Protocol Buffers and ASN.1 (with PER-encoding)? For my project the most imporant issue is the size of the serialized data. Has anyone done any data-size comparisons between the two?
If you use ASN.1 with Unaligned PER, and define your data types using the appropriate constraints (e.g., specifying lower/upper bounds for integers, upper bounds for the length of lists, etc.), your encodings will be very compact. There will be no bits wasted for things like alignment or padding between the fields, and each field will be encoded in the minimum number of bits necessary to hold its permitted range of values. For example, a field of type INTEGER (1..8) will be encoded in 3 bits (1='000', 2='001', ..., 8='111'); and a CHOICE with four alternatives will occupy 2 bits (indicating the chosen alternative) plus the bits occupied by the chosen alternative. ASN.1 has many other interesting features that have been successfully used in many published standards. An example is the extension marker ("..."), which when applied to SEQUENCE, CHOICE, ENUMERATED, and other types, enables backward- and forward compatibility between endpoints implementing different versions of the specification.
It's a long time since I've done any ASN.1 work, but the size is very likely to depend on the details of your types and actual data.
I would strongly recommend that you prototype both and put some real data in to compare.
If your protocol buffer would contain repeated primitive types, you should look at the latest source in Subversion for Protocol Buffers - they can be represented in a "packed" format now which is much more space-efficient. (My C# port has just caught up with this feature, some time last week.)
When size of the packed/encoded message is important you should also note the fact that protobuf is not able to pack repeated fields that are not of a primitive numeric type, read this for more information.
This is an issue e.g. if you have messages of that type: (comment defines actual range of values)
message P{
required sint32 x = 1; // -0x1ffff to 0x20000
required sint32 y = 2; // -0x1ffff to 0x20000
required sint32 z = 3; // -0x319c to 0x3200
}
message Array{
repeated P ps = 1;
optional uint32 somemoredata = 2;
}
In case you have an array length of, e.g., 32 than you would result in a packed message size of approximately 250 to 450 bytes with protobuf, depending on what values the array actually contains. This can even increase to over 1000 bytes in case you use the full 32bit range or in case you use int32 instead of sint32 and have negative values.
The raw data blob (assuming that z can be defined as int16 value) would only consume 320 bytes and thus the ASN.1 message is always smaller than 320 bytes since the max values are actually not 32bit but 19bit (x,y) and 15bit (z).
The protobuf message size can be optimized with this message definition:
message Ps{
repeated sint32 xs = 1 [packed=true];
repeated sint32 ys = 2 [packed=true];
repeated sint32 zs = 3 [packed=true];
}
message Array{
required Ps ps = 1;
optional uint32 somemoredata = 2;
}
which results in message sizes between approximately 100 byte (all values are zeros), 300 byte (values at range max), and 500 byte (all values are high 32bit values).
Protocol Buffers does not guarantee preservation of the order of fields in the binary encoding but ASN.1 does. It is not related to size so might not be the most noticeable in your use case but it is an important difference for comparison, for digital signatures, for simplified parsing, and possibly other applications.

Resources