Working in C11, the following struct:
struct S {
unsigned a : 4;
_Bool b : 1;
};
Gets layed out by GCC as an unsigned (4 bytes) of which 4 bits are used, followed by a _Bool (4 bytes) of which 1 bit is used, for a total size of 8 bytes.
Note that C99 and C11 specifically permit _Bool as a bit-field member. The C11 standard (and probably C99 too) also states under §6.7.2.1 'Structure and union specifiers' ¶11 that:
An implementation may allocate any addressable storage unit large enough to hold a bit-field. If enough space remains, a bit-field that immediately follows another bit-field in a structure shall be packed into adjacent bits of the same unit.
So I believe that the member b above should have been packed into the storage unit allocated for the member a, resulting in a struct of total size 4 bytes.
GCC behaves correctly and packing does occur when using the same types for the two members, or when one is unsigned and the other signed, but the types unsigned and _Bool seem to be considered too distinct by GCC for it to handle them correctly.
Can someone confirm my interpretation of the standard, and that this is indeed a GCC bug?
I'm also interested in a work-around (some compiler switch, pragma, __attribute__...).
I'm using gcc 4.7.0 with -std=c11 (although other settings show the same behavior.)
The described behavior is incompatible with the C99 and C11 standards, but is provided for binary compatibility with the MSVC compiler (which has unusual struct packing behavior.)
Fortunately, it can be disabled either in the code with __attribute__((gcc_struct)) applied to the struct, or with the command-line switch -mno-ms-bitfields (see the documentation).
Using both GCC 4.7.1 (home-built) and GCC 4.2.1 (LLVM/clang†) on Mac OS X 10.7.4 with a 64-bit compilation, this code yields 4 in -std=c99 mode:
#include <stdio.h>
int main(void)
{
struct S
{
unsigned a : 4;
_Bool b : 1;
};
printf("%zu\n", sizeof(struct S));
return 0;
}
That's half the size you're reporting on Windows. It seems surprisingly large to me (I would expect it to be size of 1 byte), but the rules of the platform are what they are. Basically, the compiler is not obliged to follow the rules you'd like; it may follow the rules of the platform it is run on, and where it has the chance, it may even define the rules of the platform it is run on.
This following program has mildly dubious behaviour (because it accesses u.i after u.s was last written to), but shows that the field a is stored in the 4 least significant bits and the field b is stored in the next bit:
#include <stdio.h>
int main(void)
{
union
{
struct S
{
unsigned a : 4;
_Bool b : 1;
} s;
int i;
} u;
u.i = 0;
u.s.a = 5;
u.s.b = 1;
printf("%zu\n", sizeof(struct S));
printf("%zu\n", sizeof(u));
printf("0x%08X\n", u.i);
u.s.a = 0xC;
u.s.b = 1;
printf("0x%08X\n", u.i);
return 0;
}
Output:
4
4
0x00000015
0x0000001C
† i686-apple-darwin11-llvm-gcc-4.2 (GCC) 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.9.00)
Related
I am using C, and I want to have two versions of the same function, a scalar version and a vector version. The two functions the same signature, and the compiler should pick the correct version depending on the context - if the context is autovectorized loop, it should pick an vectorized function, otherwise it should pick the scalar version.
How to achieve this, so it works in both GCC and CLANG?
One of the suggestions is to use pragma omp declare simd ..., but this doesn't work for me because I need different implementation for the two version (vectorized version is implemented using vector intrinsics).
Let' say we have a function int square(int num) { return num*num; } and we want to have an explicit manually vectorized version.
We compile with -fopenmp-simd, -mavx2 and -O3 compilation flags. This enables SIMD extensions and enabled vectorization.
In the first compilation unit we have something like this:
#pragma omp declare simd notinbranch
int square(int num);
...
#pragma omp simd
for (int i = 0; i < SIZE; i++) {
res[i] = square(values[i]);
}
The compiler knows there is a vectorized version of square in another compilation unit.
In another compilation unit, we define square, both its scalar and vector counterparts. The scalar counterpart has a simple name square, whereas the vector counterparts use name mangling as described here.
In the other compilation unit we define:
#include <stdio.h>
int square(int num) {
printf("1");
return num * num;
}
#include <immintrin.h>
__m256i _ZGVdN8v_square(__m256i num) {
printf("2");
return num;
}
__m128i _ZGVcN4v_square(__m128i num) {
printf("3");
return num;
}
The first function square is the scalar version, the second _ZGVdN8v_square is a version that processes 8 integers in one call, and the third version _ZGVcN4v_square processes 4 integers in one call.
For this example, name mangling goes like this:
_ZGV is the vector prefix
d is AVX2 isa, c is AVX isa
N is the unmasked version (corresponds to notinbranch in square declaration)
4 and 8 are vector lengths
*v stands for vector parameter
I tested it with GCC and it works.
Question
I am testing a simple code which calculates Mandelbrot fractal. I have been checking its performance depending on the number of iterations in the function that checks if a point belongs to the Mandelbrot set or not.
The surprising thing is that I am getting a big difference in times after adding the -fPIC flag. From what I read the overhead is usually negligible and the highest overhead I came across was about 6%. I measured around 30% overhead. Any advice will be appreciated!
Details of my project
I use the -O3 flag, gcc 4.7.2, Ubuntu 12.04.2, x86_64.
The results look as follow
#iter C (fPIC) C C/C(fPIC)
1 0.01 0.01 1.00
100 0.04 0.03 0.75
200 0.06 0.04 0.67
500 0.15 0.1 0.67
1000 0.28 0.19 0.68
2000 0.56 0.37 0.66
4000 1.11 0.72 0.65
8000 2.21 1.47 0.67
16000 4.42 2.88 0.65
32000 8.8 5.77 0.66
64000 17.6 11.53 0.66
Commands I use:
gcc -O3 -fPIC fractalMain.c fractal.c -o ffpic
gcc -O3 fractalMain.c fractal.c -o f
Code: fractalMain.c
#include <time.h>
#include <stdio.h>
#include <stdbool.h>
#include "fractal.h"
int main()
{
int iterNumber[] = {1, 100, 200, 500, 1000, 2000, 4000, 8000, 16000, 32000, 64000};
int it;
for(it = 0; it < 11; ++it)
{
clock_t start = clock();
fractal(iterNumber[it]);
clock_t end = clock();
double millis = (end - start)*1000 / CLOCKS_PER_SEC/(double)1000;
printf("Iter: %d, time: %lf \n", iterNumber[it], millis);
}
return 0;
}
Code: fractal.h
#ifndef FRACTAL_H
#define FRACTAL_H
void fractal(int iter);
#endif
Code: fractal.c
#include <stdio.h>
#include <stdbool.h>
#include "fractal.h"
void multiplyComplex(double a_re, double a_im, double b_re, double b_im, double* res_re, double* res_im)
{
*res_re = a_re*b_re - a_im*b_im;
*res_im = a_re*b_im + a_im*b_re;
}
void sqComplex(double a_re, double a_im, double* res_re, double* res_im)
{
multiplyComplex(a_re, a_im, a_re, a_im, res_re, res_im);
}
bool isInSet(double P_re, double P_im, double C_re, double C_im, int iter)
{
double zPrev_re = P_re;
double zPrev_im = P_im;
double zNext_re = 0;
double zNext_im = 0;
double* p_zNext_re = &zNext_re;
double* p_zNext_im = &zNext_im;
int i;
for(i = 1; i <= iter; ++i)
{
sqComplex(zPrev_re, zPrev_im, p_zNext_re, p_zNext_im);
zNext_re = zNext_re + C_re;
zNext_im = zNext_im + C_im;
if(zNext_re*zNext_re+zNext_im*zNext_im > 4)
{
return false;
}
zPrev_re = zNext_re;
zPrev_im = zNext_im;
}
return true;
}
bool isMandelbrot(double P_re, double P_im, int iter)
{
return isInSet(0, 0, P_re, P_im, iter);
}
void fractal(int iter)
{
int noIterations = iter;
double xMin = -1.8;
double xMax = 1.6;
double yMin = -1.3;
double yMax = 0.8;
int xDim = 512;
int yDim = 384;
double P_re, P_im;
int nop;
int x, y;
for(x = 0; x < xDim; ++x)
for(y = 0; y < yDim; ++y)
{
P_re = (double)x*(xMax-xMin)/(double)xDim+xMin;
P_im = (double)y*(yMax-yMin)/(double)yDim+yMin;
if(isMandelbrot(P_re, P_im, noIterations))
nop = x+y;
}
printf("%d", nop);
}
Story behind the comparison
It might look a bit artificial to add the -fPIC flag when building executable (as per one of the comments). So a few words of explanation: first I only compiled the program as executable and wanted to compare to my Lua code, which calls the isMandelbrot function from C. So I created a shared object to call it from lua - and had big time differences. But couldn't understand why they were growing with number of iterations. In the end found out that it was because of the -fPIC. When I create a little c program which calls my lua script (so effectively I do the same thing, only don't need the .so) - the times are very similar to C (without -fPIC). So I have checked it in a few configurations over the last few days and it consistently shows two sets of very similar results: faster without -fPIC and slower with it.
It turns out that when you compile without the -fPIC option multiplyComplex, sqComplex, isInSet and isMandelbrot are inlined automatically by the compiler. If you define those functions as static you will likely get the same performance when compiling with -fPIC because the compiler will be free to perform inlining.
The reason why the compiler is unable to automatically inline the helper functions has to do with symbol interposition. Position independent code is required to access all global data indirectly, i.e. through the global offset table. The very same constraint applies to function calls, which have to go through the procedure linkage table. Since a symbol might get interposed by another one at runtime (see LD_PRELOAD), the compiler cannot simply assume that it is safe to inline a function with global visibility.
The very same assumption can be made if you compile without -fPIC, i.e. the compiler can safely assume that a global symbol defined in the executable cannot be interposed because the lookup scope begins with the executable itself which is then followed by all other libraries, including the preloaded ones.
For a more thorough understanding have a look at the following paper.
As other people already pointed out -fPIC forces GCC to disable many optimizations e.g. inlining and cloning. I'd like to point out several ways to overcome this:
replace -fPIC with -fPIE if you are compiling main executable (not libraries) as this allows compiler to assume that interposition is not possible;
use -fvisibility=hidden and __attribute__((visibility("default"))) to export only necessary functions from the library and hide the rest; this would allow GCC to optimize hidden functions more aggressively;
use private symbol aliases (__attribute__((alias ("__f")));) to refer to library functions from within the library; this would again untie GCC's hands
previous suggestion can be automated with -fno-semantic-interposition flag that was added in recent GCC versions
It's interesting to note that Clang is different from GCC as it allows all optimizations by default regardless of -fPIC (can be overridden with -fsemantic-interposition to obtain GCC-like behavior).
As others have discussed in the comment section of your opening post, compiling with -flto should help to reduce the difference in run-times you are seeing for this particular case, since the link time optimisations of gcc will likely figure out that it's actually ok to inline a couple of functions ;)
In general, link time optimisations could lead to massive reductions in code size (~6%) link to paper on link time optimisations in gold, and thus run time as well (more of your program fits in the cache). Also note that -fPIC is mostly viewed as a feature that enables tighter security and is always enabled in android. This question on SO briefly discusses as well. Also, just to let you know, -fpic is the faster version of -fPIC, so if you must use -fPIC try -fpic instead - link to gcc docs. For x86 it might not make a difference, but you need to check this for yourself/ask on gcc-help.
I am facing some very weird rounding errors when compiling my code with intel 2018 when compared to gcc 7.2.0. I'm simply looking into taking the absolutely value of extrememly small number:
#include <cfloat>
#include <math.h>
#include <stdlib.h>
#include <stdio.h>
int main() {
double numa = -1.3654159537789158e-08;
double numb = -7.0949094162313382e-08;
if (isnan(numa))
printf("numa is nan \n");
if (isnan(numb))
printf("numb is nan \n");
printf("abs(numa) %.17g \n", abs(numa));
printf("abs(numb) %.17g \n", abs(numb));
if ((isnan(numa) || (abs(numa) < DBL_EPSILON)) || (isnan(numb) || (abs(numb) < DBL_EPSILON))) {
printf("x %.17g y %.17g DBL_E %.17g \n", numa, numb, DBL_EPSILON);
}
return 0;
}
Here is the output when compiling the code with gcc 7.2.0, which is expected:
$ ./a.out
abs(numa) 1.3654159537789158e-08
abs(numb) 7.0949094162313382e-08
But it is a different story for intel/2018:
$ ./a.out
abs(numa) 2.0410903428666442e-314
abs(numb) 2.0410903428666442e-314
x -1.3654159537789158e-08 y -7.0949094162313382e-08 DBL_E 2.2204460492503131e-16
What could cause my version of Intel compilers to have such a huge difference?
Wrong function or wrong language
Output with "gcc 7.2.0" is as expected because OP compiled with C++
With "intel/2018" the output is consistent with a forced C compilation.
With C, the abs(numa) converts numa to an int with the value of 0 and the below is undefined behavior (UB) as "%.17g" expects a double and not an int.
// In C UB: vvvvv------vvvvvvvvv
printf("abs(numa) %.17g \n", abs(numa));
With the UB output of "abs(numa) 2.0410903428666442e-314", we can do some forensics.
Typical 2.0410903428666442e-314 in binary is
00000000 00000000 00000000 00000000 11110110 00111101 01001110 00101110
This is consistent with some C compilations that pass a 32-bit int 0 and then printf() retrieved that along with some other following junk as the expected double.
As UB, this result may vary from time-to-time, if output at all, yet is good indicator of the problem: Compile in C++ or change to fabs() (#dmuir) to take the absolute value of a double in both C++ and C.
Some kudos to OP for using "%g" (or "%e") when debugging a floating point issues. Far more informative the "%f"
In the IEEE754 standarad, the minimum strictly positive (subnormal) value is 2−16493 ≈ 10−4965 using Quadruple-precision floating-point format. Why does GCC reject anything lower than 10-4949? I'm looking for an explanation of the different things that could be going on underneath which determine the limit to be 10-4949 rather than 10−4965.
#include <stdio.h>
void prt_ldbl(long double decker) {
unsigned char * desmond = (unsigned char *) & decker;
int i;
for (i = 0; i < sizeof (decker); i++) {
printf ("%02X ", desmond[i]);
}
printf ("\n");
}
int main()
{
long double x = 1e-4955L;
prt_ldbl(x);
}
I'm using GNU GCC version 4.8.1 online - not sure which architecture it's running on (which I realize may be the culprit). Please feel free to post your findings from different architectures.
Your long double type may not be(*) quadruple-precision. It may simply be the 387 80-bit extended-double format. This format has the same number of bits for the exponent as quad-precision, but many fewer significand bits, so the minimum value that would be representable in it sounds about right (2-16445)
(*) Your long double is likely not to be quad-precision, because no processor implements quad-precision in hardware. The compiler can always implement quad-precision in software, but it is much more likely to map long double to double-precision, to extended-double or to double-double.
The smallest 80-bit long double is around 2-16382 - 63 ~= 10-4951, not 2-164934. So the compiler is entirely correct; your number is smaller than the smallest subnormal.
I'm developing an accelerated component in OpenCL, using Xcode 4.5.1 and Grand Central Dispatch, guided by this tutorial.
The full kernel kept failing on the GPU, giving signal SIGABRT. I couldn't make much progress interpreting the error beyond that.
But I broke out aspects of the kernel to test, and I found something very peculiar involving assigning certain values to positions in an array within a loop.
Test scenario: give each thread a fixed range of array indices to initialize.
kernel void zero(size_t num_buckets, size_t positions_per_bucket, global int* array) {
size_t bucket_index = get_global_id(0);
if (bucket_index >= num_buckets) return;
for (size_t i = 0; i < positions_per_bucket; i++)
array[bucket_index * positions_per_bucket + i] = 0;
}
The above kernel fails. However, when I assign 1 instead of 0, the kernel succeeds (and my host code prints out the array of 1's). Based on a handful of tests on various integer values, I've only had problems with 0 and -1.
I've tried to outsmart the compiler with 1-1, (int) 0, etc, with no success. Passing zero in as a kernel argument worked though.
The assignment to zero does work outside of the context of a for loop:
array[bucket_index * positions_per_bucket] = 0;
The findings above were confirmed on two machines with different configurations. (OSX 10.7 + GeForce, OSX 10.8 + Radeon.) Furthermore, the kernel had no trouble when running on CL_DEVICE_TYPE_CPU -- it's just on the GPU.
Clearly, something ridiculous is happening, and it must be on my end, because "zero" can't be broken. Hopefully it's something simple. Thank you for your help.
Host code:
#include <stdio.h>
#include <OpenCL/OpenCL.h>
#include "zero.cl.h"
int main(int argc, const char* argv[]) {
dispatch_queue_t queue = gcl_create_dispatch_queue(CL_DEVICE_TYPE_GPU, NULL);
size_t num_buckets = 64;
size_t positions_per_bucket = 4;
cl_int* h_array = malloc(sizeof(cl_int) * num_buckets * positions_per_bucket);
cl_int* d_array = gcl_malloc(sizeof(cl_int) * num_buckets * positions_per_bucket, NULL, CL_MEM_WRITE_ONLY);
dispatch_sync(queue, ^{
cl_ndrange range = { 1, { 0 }, { num_buckets }, { 0 } };
zero_kernel(&range, num_buckets, positions_per_bucket, d_array);
gcl_memcpy(h_array, d_array, sizeof(cl_int) * num_buckets * positions_per_bucket);
});
for (size_t i = 0; i < num_buckets * positions_per_bucket; i++)
printf("%d ", h_array[i]);
printf("\n");
}
Refer to the OpenCL standard, section 6, paragraph 8 "Restrictions", bullet point k (emphasis mine):
6.8 k. Arguments to kernel functions in a program cannot be declared with the built-in scalar types bool, half, size_t, ptrdiff_t, intptr_t, and uintptr_t. [...]
The fact that your compiler even let you build the kernel at all indicates it is somewhat broken.
So you might want to fix that... but if that doesn't fix it, then it looks like a compiler bug, plain and simple (of CLC, that is, the OpenCL compiler, not your host code). There is no reason this kernel should work with any constant other than 0, -1. Did you try updating your OpenCL driver, what about trying on a different operating system (though I suppose this code is OS X only)?