Intel compilers cannot handle absolute value of small number - gcc

I am facing some very weird rounding errors when compiling my code with intel 2018 when compared to gcc 7.2.0. I'm simply looking into taking the absolutely value of extrememly small number:
#include <cfloat>
#include <math.h>
#include <stdlib.h>
#include <stdio.h>
int main() {
double numa = -1.3654159537789158e-08;
double numb = -7.0949094162313382e-08;
if (isnan(numa))
printf("numa is nan \n");
if (isnan(numb))
printf("numb is nan \n");
printf("abs(numa) %.17g \n", abs(numa));
printf("abs(numb) %.17g \n", abs(numb));
if ((isnan(numa) || (abs(numa) < DBL_EPSILON)) || (isnan(numb) || (abs(numb) < DBL_EPSILON))) {
printf("x %.17g y %.17g DBL_E %.17g \n", numa, numb, DBL_EPSILON);
}
return 0;
}
Here is the output when compiling the code with gcc 7.2.0, which is expected:
$ ./a.out
abs(numa) 1.3654159537789158e-08
abs(numb) 7.0949094162313382e-08
But it is a different story for intel/2018:
$ ./a.out
abs(numa) 2.0410903428666442e-314
abs(numb) 2.0410903428666442e-314
x -1.3654159537789158e-08 y -7.0949094162313382e-08 DBL_E 2.2204460492503131e-16
What could cause my version of Intel compilers to have such a huge difference?

Wrong function or wrong language
Output with "gcc 7.2.0" is as expected because OP compiled with C++
With "intel/2018" the output is consistent with a forced C compilation.
With C, the abs(numa) converts numa to an int with the value of 0 and the below is undefined behavior (UB) as "%.17g" expects a double and not an int.
// In C UB: vvvvv------vvvvvvvvv
printf("abs(numa) %.17g \n", abs(numa));
With the UB output of "abs(numa) 2.0410903428666442e-314", we can do some forensics.
Typical 2.0410903428666442e-314 in binary is
00000000 00000000 00000000 00000000 11110110 00111101 01001110 00101110
This is consistent with some C compilations that pass a 32-bit int 0 and then printf() retrieved that along with some other following junk as the expected double.
As UB, this result may vary from time-to-time, if output at all, yet is good indicator of the problem: Compile in C++ or change to fabs() (#dmuir) to take the absolute value of a double in both C++ and C.
Some kudos to OP for using "%g" (or "%e") when debugging a floating point issues. Far more informative the "%f"

Related

GCC AVX __m256i cast to int array leads to wrong values [duplicate]

I'm trying to learn to code using intrinsics and below is a code which does addition
compiler used: icc
#include<stdio.h>
#include<emmintrin.h>
int main()
{
__m128i a = _mm_set_epi32(1,2,3,4);
__m128i b = _mm_set_epi32(1,2,3,4);
__m128i c;
c = _mm_add_epi32(a,b);
printf("%d\n",c[2]);
return 0;
}
I get the below error:
test.c(9): error: expression must have pointer-to-object type
printf("%d\n",c[2]);
How do I print the values in the variable c which is of type __m128i
Use this function to print them:
#include <stdint.h>
#include <string.h>
void print128_num(__m128i var)
{
uint16_t val[8];
memcpy(val, &var, sizeof(val));
printf("Numerical: %i %i %i %i %i %i %i %i \n",
val[0], val[1], val[2], val[3], val[4], val[5],
val[6], val[7]);
}
You split 128bits into 16-bits(or 32-bits) before printing them.
This is a way of 64-bit splitting and printing if you have 64-bit support available:
#include <inttypes.h>
void print128_num(__m128i var)
{
int64_t v64val[2];
memcpy(v64val, &var, sizeof(v64val));
printf("%.16llx %.16llx\n", v64val[1], v64val[0]);
}
Note: casting the &var directly to an int* or uint16_t* would also work MSVC, but this violates strict aliasing and is undefined behaviour. Using memcpy is the standard compliant way to do the same and with minimal optimization the compiler will generate the exact same binary code.
Portable across gcc/clang/ICC/MSVC, C and C++.
fully safe with all optimization levels: no strict-aliasing violation UB
print in hex as u8, u16, u32, or u64 elements (based on #AG1's answer)
Prints in memory order (least-significant element first, like _mm_setr_epiX). Reverse the array indices if you prefer printing in the same order Intel's manuals use, where the most significant element is on the left (like _mm_set_epiX). Related: Convention for displaying vector registers
Using a __m128i* to load from an array of int is safe because the __m128 types are defined to allow aliasing just like ISO C unsigned char*. (e.g. in gcc's headers, the definition includes __attribute__((may_alias)).)
The reverse isn't safe (pointing an int* onto part of a __m128i object). MSVC guarantees that's safe, but GCC/clang don't. (-fstrict-aliasing is on by default). It sometimes works with GCC/clang, but why risk it? It sometimes even interferes with optimization; see this Q&A. See also Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?
See GCC AVX _m256i cast to int array leads to wrong values for a real-world example of GCC breaking code which points an int* at a __m256i.
(uint32_t*) &my_vector violates the C and C++ aliasing rules, and is not guaranteed to work the way you'd expect. Storing to a local array and then accessing it is guaranteed to be safe. It even optimizes away with most compilers, so you get movq / pextrq directly from xmm to integer registers instead of an actual store/reload, for example.
Source + asm output on the Godbolt compiler explorer: proof it compiles with MSVC and so on.
#include <immintrin.h>
#include <stdint.h>
#include <stdio.h>
#ifndef __cplusplus
#include <stdalign.h> // C11 defines _Alignas(). This header defines alignas()
#endif
void p128_hex_u8(__m128i in) {
alignas(16) uint8_t v[16];
_mm_store_si128((__m128i*)v, in);
printf("v16_u8: %x %x %x %x | %x %x %x %x | %x %x %x %x | %x %x %x %x\n",
v[0], v[1], v[2], v[3], v[4], v[5], v[6], v[7],
v[8], v[9], v[10], v[11], v[12], v[13], v[14], v[15]);
}
void p128_hex_u16(__m128i in) {
alignas(16) uint16_t v[8];
_mm_store_si128((__m128i*)v, in);
printf("v8_u16: %x %x %x %x, %x %x %x %x\n", v[0], v[1], v[2], v[3], v[4], v[5], v[6], v[7]);
}
void p128_hex_u32(__m128i in) {
alignas(16) uint32_t v[4];
_mm_store_si128((__m128i*)v, in);
printf("v4_u32: %x %x %x %x\n", v[0], v[1], v[2], v[3]);
}
void p128_hex_u64(__m128i in) {
alignas(16) unsigned long long v[2]; // uint64_t might give format-string warnings with %llx; it's just long in some ABIs
_mm_store_si128((__m128i*)v, in);
printf("v2_u64: %llx %llx\n", v[0], v[1]);
}
If you need portability to C99 or C++03 or earlier (i.e. without C11 / C++11), remove the alignas() and use storeu instead of store. Or use __attribute__((aligned(16))) or __declspec( align(16) ) instead.
(If you're writing code with intrinsics, you should be using a recent compiler version. Newer compilers usually make better asm than older compilers, including for SSE/AVX intrinsics. But maybe you want to use gcc-6.3 with -std=gnu++03 C++03 mode for a codebase that isn't ready for C++11 or something.)
Sample output from calling all 4 functions on
// source used:
__m128i vec = _mm_setr_epi8(1, 2, 3, 4, 5, 6, 7,
8, 9, 10, 11, 12, 13, 14, 15, 16);
// output:
v2_u64: 0x807060504030201 0x100f0e0d0c0b0a09
v4_u32: 0x4030201 0x8070605 0xc0b0a09 0x100f0e0d
v8_u16: 0x201 0x403 0x605 0x807 | 0xa09 0xc0b 0xe0d 0x100f
v16_u8: 0x1 0x2 0x3 0x4 | 0x5 0x6 0x7 0x8 | 0x9 0xa 0xb 0xc | 0xd 0xe 0xf 0x10
Adjust the format strings if you want to pad with leading zeros for consistent output width. See printf(3).
I know this question is tagged C, but it was the best search result also when looking for a C++ solution to the same problem.
So, this could be a C++ implementation:
#include <string>
#include <cstring>
#include <sstream>
#if defined(__SSE2__)
template <typename T>
std::string __m128i_toString(const __m128i var) {
std::stringstream sstr;
T values[16/sizeof(T)];
std::memcpy(values,&var,sizeof(values)); //See discussion below
if (sizeof(T) == 1) {
for (unsigned int i = 0; i < sizeof(__m128i); i++) { //C++11: Range for also possible
sstr << (int) values[i] << " ";
}
} else {
for (unsigned int i = 0; i < sizeof(__m128i) / sizeof(T); i++) { //C++11: Range for also possible
sstr << values[i] << " ";
}
}
return sstr.str();
}
#endif
Usage:
#include <iostream>
[..]
__m128i x
[..]
std::cout << __m128i_toString<uint8_t>(x) << std::endl;
std::cout << __m128i_toString<uint16_t>(x) << std::endl;
std::cout << __m128i_toString<uint32_t>(x) << std::endl;
std::cout << __m128i_toString<uint64_t>(x) << std::endl;
Result:
141 114 0 0 0 0 0 0 151 104 0 0 0 0 0 0
29325 0 0 0 26775 0 0 0
29325 0 26775 0
29325 26775
Note: there exists a simple way to avoid the if (size(T)==1), see https://stackoverflow.com/a/28414758/2436175
#include<stdio.h>
#include<emmintrin.h>
int main()
{
__m128i a = _mm_set_epi32(1,2,3,4);
__m128i b = _mm_set_epi32(1,2,3,4);
__m128i c;
const int32_t* q;
//add a pointer
c = _mm_add_epi32(a,b);
q = (const int32_t*) &c;
printf("%d\n",q[2]);
//printf("%d\n",c[2]);
return 0;
}
Try this code.

Send function pointer via MPI

Is it safe to pass function pointers via MPI as a way of telling another node to call a function? Someone may say that Passing any kind of pointers via MPI is meaningless, but I wrote some code to verify it.
//test.cpp
#include <cstdio>
#include <iostream>
#include <mpi.h>
#include <cstring>
using namespace std;
int f1(int a){return a + 1;}
int f2(int a){return a + 2;}
int f3(int a){return a + 3;}
using F=int (*)(int);
int main(int argc, char *argv[]){
MPI_Init(&argc, &argv);
int rank, size;
MPI_Status state;
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
//test
char data[10];
if( 0 == rank ){
*(reinterpret_cast<F*>(data))=&f2;
for(int i = 1 ; i < size ; ++i)
MPI_Send(data, 8, MPI_CHAR, i, 0, MPI_COMM_WORLD);
}else{
MPI_Recv(data, 8, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &state);
F* fp = reinterpret_cast<F*>(data);
int ans = (**fp)(10);
cout << ans << endl;
}
MPI_Finalize();
return 0;
}
Here is the output:
12
12
12
12
12
12
12
12
12
I ran it via MVAPICH, and it works well. But I just don't now why since separate address spaces means that the pointer value is USELESS in any process other than the one that generated it.
P.S. here is my hostfile
blade11:1
blade12:1
blade13:1
blade14:1
blade15:1
blade16:1
blade17:1
blade18:2
blade19:1
and I ran mpiexec -n 10 -f hostfile ./test, and compiled it using C++11
You are lucky in the sense that your cluster environment is homogeneous and no address space randomisation for ordinary executables is in place. As a consequence, all images are loaded at the same base address and laid out similarly in memory, hence functions have the same virtual addresses in all MPI ranks (note that this is rarely true for symbols from dynamically linked libraries as those are usually loaded at random addresses).
If you compile the source twice using different compilers or using the same compiler but with different compiler options, then have some ranks run the first executable and the rest run the second one, the program will definitely crash.
Try this:
$ mpicxx -std=c++11 -O0 -o test_O0 test.cpp
$ mpicxx -std=c++11 -O2 -o test_O2 test.cpp
$ mpiexec -f hostfile -n 5 ./test_O0 : -n 5 ./test_O2
12
12
12
12
<crash>
The different levels of optimisation result in function code of different size in test_O0 and test_O2. Consequently, f2 will no longer have the same virtual address in all ranks. The ranks that run the same executable as rank 0 will print 12, while the rest will segfault.
Is it safe to pass function pointers via MPI as a way of telling another node to call a function?
No, it is not. Address space is not shared among processes.
However, MPI processes which are the result of programs built from the same source can be organised to call a specific function when a certain message is received:
char data = 0;
MPI_Recv(data, 1, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &state);
if (data == 255) {
f2(10); /* and so forth */
}
No.
However there is trick involving macros that map a certain codification of a function to a local function pointer/callback that can be recognized in all processes uniformly.
For example, this is used in HPX http://stellar.cct.lsu.edu/files/hpx_0.9.5/html/HPX_PLAIN_ACTION.html to run a function across inhomogeneous systems.

How do the conversions between signed, unsigned and float types work?

The compiler I use is g++ (Ubuntu 4.8.4-2ubuntu1~14.04) 4.8.4.
I compile my programs with the following command:
g++ -std=c++11 -pedantic -Wall program.cpp
The program no. 1.:
#include <iostream>
using namespace std;
int main() {
unsigned int b;
b = -54;
cout << b << endl;
return 0;
}
The program prints 4294967242 and this is the value I expected, because this is the case when we assign an out-of-range value to a variable of unsigned type, so the result is the remainder of a modulo division.
The program no. 2.:
#include <iostream>
using namespace std;
int main() {
unsigned int b;
b = 54.1234;
cout << b << endl;
return 0;
}
The program prints 54, and this is also OK, because the stored value is the part before the decimal point, and the franctional part is truncated.
The program no. 3.:
#include <iostream>
using namespace std;
int main() {
unsigned int b;
b = -54.1234;
cout << b << endl;
return 0;
}
Here during compilation I get the warning "overflow in implicit constant conversion".
And the program prints 0. Why is it so? I thought that it will do the truncation of the fractional part (as in program 2) and then store the result of the modulo division (as in program 1).
But if I write program no. 4.:
program no. 4.
#include <iostream>
using namespace std;
int main() {
unsigned int b;
float k = -54.1234;
b = k;
cout << b << endl;
return 0;
}
then I get no warning, and I get the result (expected by me) 4294967242, which is the result of the modulo division.
I would be grateful if somebody can explain it to me.
Why doesn't the program no. 3 behave like program no. 4? Why don't I get a warning when compiling program no. 1, but I get one when compiling program no. 3.?
According to the standard (§[conv.fpint]).
A prvalue of a floating point type can be converted to a prvalue of an integer type. The conversion truncates; that is, the fractional part is discarded. The behavior is undefined if the truncated value cannot be represented in the destination type.
So, your -54.1234 is truncated to -54. Since that can't be represented in an unsigned, you get undefined behavior.
When converting floating point numbers to integers, C and C++ round floating point numbers towards zero. The rounded result must then be representable in the destination type.
As a result, for 32 bit unsigned int the conversion is guaranteed to give the correct result if -1 < x < 2^32. For smaller numbers there are no guarantees. Since numbers between -1 and 0 must be rounded to zero, and numbers -1 and smaller have no requirements, it wouldn't be surprising if the compiler checks whether x < 0 and gives a result of 0 in that case. (The compiler might check whether x < 1 and give a result of 0; this handles very small positive numbers as well).

A bug in GCC implementation of bit-fields

Working in C11, the following struct:
struct S {
unsigned a : 4;
_Bool b : 1;
};
Gets layed out by GCC as an unsigned (4 bytes) of which 4 bits are used, followed by a _Bool (4 bytes) of which 1 bit is used, for a total size of 8 bytes.
Note that C99 and C11 specifically permit _Bool as a bit-field member. The C11 standard (and probably C99 too) also states under §6.7.2.1 'Structure and union specifiers' ¶11 that:
An implementation may allocate any addressable storage unit large enough to hold a bit-field. If enough space remains, a bit-field that immediately follows another bit-field in a structure shall be packed into adjacent bits of the same unit.
So I believe that the member b above should have been packed into the storage unit allocated for the member a, resulting in a struct of total size 4 bytes.
GCC behaves correctly and packing does occur when using the same types for the two members, or when one is unsigned and the other signed, but the types unsigned and _Bool seem to be considered too distinct by GCC for it to handle them correctly.
Can someone confirm my interpretation of the standard, and that this is indeed a GCC bug?
I'm also interested in a work-around (some compiler switch, pragma, __attribute__...).
I'm using gcc 4.7.0 with -std=c11 (although other settings show the same behavior.)
The described behavior is incompatible with the C99 and C11 standards, but is provided for binary compatibility with the MSVC compiler (which has unusual struct packing behavior.)
Fortunately, it can be disabled either in the code with __attribute__((gcc_struct)) applied to the struct, or with the command-line switch -mno-ms-bitfields (see the documentation).
Using both GCC 4.7.1 (home-built) and GCC 4.2.1 (LLVM/clang†) on Mac OS X 10.7.4 with a 64-bit compilation, this code yields 4 in -std=c99 mode:
#include <stdio.h>
int main(void)
{
struct S
{
unsigned a : 4;
_Bool b : 1;
};
printf("%zu\n", sizeof(struct S));
return 0;
}
That's half the size you're reporting on Windows. It seems surprisingly large to me (I would expect it to be size of 1 byte), but the rules of the platform are what they are. Basically, the compiler is not obliged to follow the rules you'd like; it may follow the rules of the platform it is run on, and where it has the chance, it may even define the rules of the platform it is run on.
This following program has mildly dubious behaviour (because it accesses u.i after u.s was last written to), but shows that the field a is stored in the 4 least significant bits and the field b is stored in the next bit:
#include <stdio.h>
int main(void)
{
union
{
struct S
{
unsigned a : 4;
_Bool b : 1;
} s;
int i;
} u;
u.i = 0;
u.s.a = 5;
u.s.b = 1;
printf("%zu\n", sizeof(struct S));
printf("%zu\n", sizeof(u));
printf("0x%08X\n", u.i);
u.s.a = 0xC;
u.s.b = 1;
printf("0x%08X\n", u.i);
return 0;
}
Output:
4
4
0x00000015
0x0000001C
† i686-apple-darwin11-llvm-gcc-4.2 (GCC) 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.9.00)

Char conversion in gcc

What are the char implicit typecasting rules? The following code gives an awkward output of -172.
char x = 200;
char y = 140;
printf("%d", x+y);
My guess is that being signed, x is casted into 72, and y is casted into 12, which should give 84 as the answer, which however is not the case as mentioned above. I am using gcc on Ubuntu.
The following code gives an awkward output of -172.
The behavior of an overflow is implementation dependent, but visibly in your case (and mine) a char has 8 bits and its representation is the complement by 2. So the binary representation of the unsigned char 200 and 140 are 11001000 and 10001100, corresponding to the binary representation of the  signed char -56 and -116, and -56 + -116 equals -172 (the char are promoted to int to do the addition).
Example forcing x and y to be signed whatever the default for char:
#include <stdio.h>
int main()
{
signed char x = 200;
signed char y = 140;
printf("%d %d %d\n", x, y, x+y);
return 0;
}
Compilation and execution :
pi#raspberrypi:/tmp $ gcc -Wall c.c
pi#raspberrypi:/tmp $ ./a.out
-56 -116 -172
pi#raspberrypi:/tmp $
My guess is that being signed, x is casted into 72, and y is casted into 12
You supposed the higher bit is removed (11001000 -> 1001000 and 10001100 -> 1100) but this is not the case, contrarily to the IEEE floats using a bit for the sign.

Resources