Why do x64 projects use a default packing alignment of 16? - windows

If you compile the following code in a x64 project in VS2012 without any /Zp flags:
#pragma pack(show)
then the compiler will spit out:
value of pragma pack(show) == 16
If the project uses Win32 then, the compiler will spit out:
value of pragma pack(show) == 8
What I don't understand is that the largest natural alignment of any type (ie: long long and pointer) in Win64 is 8. So why not just make the default alignment 8 for x64?
Somewhat related to that, why would anyone ever use /Zp16?
EDIT:
Here's an example to show what I'm talking about. Even though pointers have a natural alignment of 8 bytes for x64, Zp1 can force them to a 1 byte boundary.
struct A
{
char a;
char* b;
}
// Zp16
// Offset of a == 0
// Offset of b == 8
// Zp1
// Offset of a == 0
// Offset of b == 1
Now if we take an example that uses SSE:
struct A
{
char a;
char* b;
__m128 c; // uses declspec(align(16)) in xmmintrinsic.h
}
// Zp16
// Offset of a == 0
// Offset of b == 8
// Offset of c == 16
// Zp1
// Offset of a == 0
// Offset of b == 1
// Offset of c == 16
If __m128 were truly a builtin type, then I'd expect the offset to be 9 with Zp1. But since it uses __declspec(align(16)) in its definition in xmmintrinsic.h, that trumps any Zp settings.
So here's my question worded a little differently: is there a type for 'c' that has a natural alignment of 16B but will have an offset of 9 in the previous example?

The MSDN page here includes the following relevant information about your question "why not make the default alignment 8 for x64?":
Writing applications that use the latest processor instructions introduces some new constraints and issues. In particular, many new instructions require that data must be aligned to 16-byte boundaries. Additionally, by aligning frequently used data to the cache line size of a specific processor, you improve cache performance. For example, if you define a structure whose size is less than 32 bytes, you may want to align it to 32 bytes to ensure that objects of that structure type are efficiently cached.

Why do x64 projects use a default packing alignment of 16?
On x64 the floating point is performed in the SSE unit. You state that the largest type has alignment 8. But that is not correct. Some of the SSE intrinsic types, for example __m128, have alignment of 16.

Related

What are the errors in these avx2 intrinsics, and how to use the raw assembler [duplicate]

This question already has answers here:
inlining failed in call to always_inline '__m256d _mm256_broadcast_sd(const double*)'
(1 answer)
The Effect of Architecture When Using SSE / AVX Intrinisics
(2 answers)
Closed 2 years ago.
The Intel reference manual is fairly opaque on the details of how instructions are used. There are no examples on each instruction page.
https://software.intel.com/sites/default/files/4f/5b/36945
It's possible we haven't found the right page, or the right manual yet. In any case, the syntax of gnu as might be different.
So I thought perhaps we could call intrinsics, step in and view the opcodes in gdb, and from that perhaps learn the opcodes in gnu as. we also searched for the opcodes for avx2 in gnu as but didn't find them documented.
Starting with the c++ code that isn't compiling:
#include <immintrin.h>
int main() {
__m256i a = _mm256_set_epi32(1, 2, 3, 4, 5, 6, 7, 8); // seems to work
double x = 1.0, y = 2.0, z = 3.0;
__m256d c,d;
__m256d b = _mm256_loadu_pd(&x);
// __m256d c = _mm256_loadu_pd(&y);
// __m256d d = _mm256_loadu_pd(&z);
d = _mm256_fmadd_pd(b, c, d); // c += a * b
_mm256_add_pd(b, c);
}
g++ -g
We would like to be able to load a vector register %ymm0 with a single value, which appears to be supported by the intrinsic: _mm256_loadu_pd.
Fused multiply-add and add are also there. All intrinsics except the first give errors.
/usr/lib/gcc/x86_64-linux-gnu/9/include/fmaintrin.h:47:1: error: inlining failed in call to always_inline ‘__m256d _mm256_fmadd_pd(__m256d, __m256d, __m256d)’: target specific option mismatch
47 | _mm256_fmadd_pd (__m256d __A, __m256d __B, __m256d __C)
| ^~~~~~~~~~~~~~~
Next, what are the syntax of the underlying assembler instructions? If you could point to a manual showing them that would be very helpful.

Understanding CL_DEVICE_MAX_WORK_GROUP_SIZE limit OpenCL?

I have little bit difficulty understanding max work group limit reported by OpenCL and how it affects the program.
So my program is reporting following thing,
CL_DEVICE_MAX_WORK_ITEM_SIZES : 1024, 1024, 1024
CL_DEVICE_MAX_WORK_GROUP_SIZE : 256
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS : 3
Now I am writing program to add vectors with 1 million entries.
So the calculation for globalSize and localSize for NDRange is as follows
int localSize = 64;
// Number of total work items - localSize must be devisor
globalSize = ceil(n/(float)localSize)*localSize;
.......
// Execute the kernel over the entire range of the data set
err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &globalSize, &localSize,
0, NULL, NULL);
Here as per my understanding OpenCL indirectly calculates the number of work groups it will launch.
For above example
globalSize = 15625 * 64 -> 1,000,000 -> So this is total number of threads that will be launched
localSize = 64 -> So each work group will have 64 work items
Hence from above we get
Total Work Groups Launched = globalSize/ localSize -> 15625 Work Groups
Here my confusion starts,
If you see value reported by OpenCL CL_DEVICE_MAX_WORK_GROUP_SIZE : 256 So, I was thinking this means max my device can launch 256 work groups in one dimension,
but above calculations showed that I am launching 15625 work groups.
So how is this thing working ?
I hope some one can clarify my confusion.
I am sure I am understanding something wrong.
Thanks in advance.
According to the specification of clEnqueueNDRangeKernel: https://www.khronos.org/registry/OpenCL/sdk/2.2/docs/man/html/clEnqueueNDRangeKernel.html,
CL_DEVICE_MAX_WORK_ITEM_SIZES and CL_DEVICE_MAX_WORK_GROUP_SIZE indicate the limits of local size (CL_​KERNEL_​WORK_​GROUP_​SIZE is CL_DEVICE_MAX_WORK_GROUP_SIZE in OpenCL 1.2).
const int dimension = n;
const int localSizeDim[n] = { ... }; // Each element must be less than or equal to 'CL_DEVICE_MAX_WORK_ITEM_SIZES[i]'
const int localSize = localSizeDim[0] * localSizeDim[1] * ... * localSizeDim[n-1]; // The size must be less than or equal to 'CL_DEVICE_MAX_WORK_GROUP_SIZ'
I couldn't find the device limit of global work items, but maximum value representable by size t is the limit of global work items in the description of the error CL_​INVALID_​GLOBAL_​WORK_​SIZE.

CUDA profiler reports inefficient global memory access

I have a simple CUDA kernel which I thought was accessing global memory efficiently. The Nvidia profiler however reports that I am performing inefficient global memory accesses. My kernel code is:
__global__ void update_particles_kernel
(
float4 *pos,
float4 *vel,
float4 *acc,
float dt,
int numParticles
)
{
int index = threadIdx.x + blockIdx.x * blockDim.x;
int offset = 0;
while(index + offset < numParticles)
{
vel[index + offset].x += dt*acc[index + offset].x; // line 247
vel[index + offset].y += dt*acc[index + offset].y;
vel[index + offset].z += dt*acc[index + offset].z;
pos[index + offset].x += dt*vel[index + offset].x; // line 251
pos[index + offset].y += dt*vel[index + offset].y;
pos[index + offset].z += dt*vel[index + offset].z;
offset += blockDim.x * gridDim.x;
}
In particular the profiler reports the following:
From the CUDA best practices guide it says:
"For devices of compute capability 2.x, the requirements can be summarized quite easily: the concurrent accesses of the threads of a warp will coalesce into a number of transactions equal to the number of cache lines necessary to service all of the threads of the warp. By default, all accesses are cached through L1, which as 128-byte lines. For scattered access patterns, to reduce overfetch, it can sometimes be useful to cache only in L2, which caches shorter 32-byte segments (see the CUDA C Programming Guide).
For devices of compute capability 3.x, accesses to global memory are cached only in L2; L1 is reserved for local memory accesses. Some devices of compute capability 3.5, 3.7, or 5.2 allow opt-in caching of globals in L1 as well."
Now in my kernel based on this information I would expect that 16 accesses would be required to service a 32 thread warp because float4 is 16 bytes and on my card (770m compute capability 3.0) reads from the L2 cache are performed in 32 bytes chunks (16 bytes * 32 threads / 32 bytes cache lines = 16 accesses). Indeed as you can see the profiler reports that I am doing 16 access. What I don't understand is why the profiler reports that the ideal access would involve 8 L2 transactions per access for line 247 and only 4 L2 transactions per access for the remaining lines. Can someone explain what I am missing here?
I have a simple CUDA kernel which I thought was accessing global memory efficiently. The Nvidia profiler however reports that I am performing inefficient global memory accesses.
To take one example, your float4 vel array is stored in memory like this:
0.x 0.y 0.z 0.w 1.x 1.y 1.z 1.w 2.x 2.y 2.z 2.w 3.x 3.y 3.z 3.w ...
^ ^ ^ ^ ...
thread0 thread1 thread2 thread3
So when you do this:
vel[index + offset].x += ...; // line 247
you are accessing (storing) at the locations (.x) that I have marked above. The gaps in between each ^ mark indicate an inefficient access pattern, which the profiler is pointing out. (It does not matter that in the very next line of code, you are storing to the .y locations.)
There are at least 2 solutions, one of which would be a classical AoS -> SoA reorganization of your data, with appropriate code adjustments. This is well documented (e.g. here on the cuda tag and elsewhere) in terms of what it means, and how to do it, so I will let you look that up.
The other typical solution is to load a float4 quantity per thread, when you need it, and store a float4 quantity per thread, when you need to. Your code can be trivially reworked to do this, which should give improved profiling results:
//preceding code need not change
while(index + offset < numParticles)
{
float4 my_vel = vel[index + offset];
float4 my_acc = acc[index + offset];
my_vel.x += dt*my_acc.x;
my_vel.y += dt*my_acc.y;
my_vel.z += dt*my_acc.z;
vel[index + offset] = my_vel;
float4 my_pos = pos[index + offset];
my_pos.x += dt*my_vel.x;
my_pos.y += dt*my_vel.y;
my_pos.z += dt*my_vel.z;
pos[index + offset] = my_pos;
offset += blockDim.x * gridDim.x;
}
Even though you might think that this code is "less efficient" than your code, because your code "appears" to be only loading and storing .x, .y, .z, whereas mine "appears" to also load and store .w, in fact there is essentially no difference, due to the way a GPU loads and stores to/from global memory. Although your code does not appear to touch .w, in the process of accessing the adjacent elements, the GPU will load the .w elements from global memory, and also (eventually) store the .w elements back to global memory.
What I don't understand is why the profiler reports that the ideal access would involve 8 L2 transactions per access for line 247
For line 247 in your original code, you are accessing one float quantity per thread for the load operation of acc.x, and one float quantity per thread for the load operation of vel.x. A float quantity per thread by itself should require 128 bytes for a warp, which is 4 32-byte L2 cachelines. Two loads together would require 8 L2 cacheline loads. This is the ideal case, which assumes that the quantities are packed together nicely (SoA). But that is not what you have (you have AoS).

getelementptr has -1 as the first index operand

I'm reading the IR of nginx generated by Clang. In function ngx_event_expire_timers, there are some getelementptr instructions with i64 -1 as first index operand. For example,
%handler = getelementptr inbounds %struct.ngx_rbtree_node_s, %struct.ngx_rbtree_node_s* %node.addr.0.i, i64 -1, i32 2
I know the first index operand will be used as an offset to the first operand. But what does a negative offset mean?
The GEP instruction is perfectly fine with negative indices.
In this case you have something like:
node arr[100];
node* ptr = arr[50];
if ( (ptr-1)->value == ptr->value)
// then ...
GEP with negative indices just calculate the offset to the base pointer into the other direction. There is nothing wrong with it.
Considering what is doing inside nginx source code, the semantics of the getelementptr instruction is interesting. It's the result of two lines of C source code:
ev = (ngx_event_t *) ((char *) node - offsetof(ngx_event_t, timer));
ev->handler(ev);
node is of type ngx_rbtree_node_t, which is a member of ev's type ngx_event_t. That is like:
struct ngx_event_t {
....
struct ngx_rbtree_node_t time;
....
};
struct ngx_event_t *ev;
struct ngx_rbtree_node_t *node;
timer is the name of struct ngx_event_t member where node should point to.
|<- ngx_rbtree_node_t ->|
|<- ngx_event_t ->|
------------------------------------------------------
| (some data) | "time" | (some data)
------------------------------------------------------
^ ^
ev node
The graph above shows the layout of an instance of ngx_event_t. The result of offsetof(ngx_event_t, time) is 40. That means the some data before time is of 40 bytes. And the size of ngx_rbtree_node_t is also 40 bytes, by coincidence. So the i64 -1 in the first index oprand of getelementptr instruction computes the base address of the ngx_event_t containing node, which is 40 bytes ahead of node.
handler is another member of ngx_event_t, which is 16 bytes behind the base of ngx_event_t. By (another) coincidence, the third member of ngx_rbtree_node_t is also 16 bytes behind the base address of ngx_rbtree_node_t. So the i32 2 in getelementptr instruction will add 16 bytes to ev, to get the address of handler.
Note that the 16 bytes is computed from the layout of ngx_rbtree_node_t, but not ngx_event_t. Clang must have done some computations to ensure the correctness of the getelementptr instruction. Before use the value of %handler, there is a bitcast instruction which casts %handler to a function pointer type.
What Clang has done breaks the type transformation process defined in C source code. But the result is the exactly same.

How to calculate g values from LIS3DH sensor?

I am using LIS3DH sensor with ATmega128 to get the acceleration values to get motion. I went through the datasheet but it seemed inadequate so I decided to post it here. From other posts I am convinced that the sensor resolution is 12 bit instead of 16 bit. I need to know that when finding g value from the x-axis output register, do we calculate the two'2 complement of the register values only when the sign bit MSB of OUT_X_H (High bit register) is 1 or every time even when this bit is 0.
From my calculations I think that we calculate two's complement only when MSB of OUT_X_H register is 1.
But the datasheet says that we need to calculate two's complement of both OUT_X_L and OUT_X_H every time.
Could anyone enlighten me on this ?
Sample code
int main(void)
{
stdout = &uart_str;
UCSRB=0x18; // RXEN=1, TXEN=1
UCSRC=0x06; // no parit, 1-bit stop, 8-bit data
UBRRH=0;
UBRRL=71; // baud 9600
timer_init();
TWBR=216; // 400HZ
TWSR=0x03;
TWCR |= (1<<TWINT)|(1<<TWSTA)|(0<<TWSTO)|(1<<TWEN);//TWCR=0x04;
printf("\r\nLIS3D address: %x\r\n",twi_master_getchar(0x0F));
twi_master_putchar(0x23, 0b000100000);
printf("\r\nControl 4 register 0x23: %x", twi_master_getchar(0x23));
printf("\r\nStatus register %x", twi_master_getchar(0x27));
twi_master_putchar(0x20, 0x77);
DDRB=0xFF;
PORTB=0xFD;
SREG=0x80; //sei();
while(1)
{
process();
}
}
void process(void){
x_l = twi_master_getchar(0x28);
x_h = twi_master_getchar(0x29);
y_l = twi_master_getchar(0x2a);
y_h = twi_master_getchar(0x2b);
z_l = twi_master_getchar(0x2c);
z_h = twi_master_getchar(0x2d);
xvalue = (short int)(x_l+(x_h<<8));
yvalue = (short int)(y_l+(y_h<<8));
zvalue = (short int)(z_l+(z_h<<8));
printf("\r\nx_val: %ldg", x_val);
printf("\r\ny_val: %ldg", y_val);
printf("\r\nz_val: %ldg", z_val);
}
I wrote the CTRL_REG4 as 0x10(4g) but when I read them I got 0x20(8g). This seems bit bizarre.
Do not compute the 2s complement. That has the effect of making the result the negative of what it was.
Instead, the datasheet tells us the result is already a signed value. That is, 0 is not the lowest value; it is in the middle of the scale. (0xffff is just a little less than zero, not the highest value.)
Also, the result is always 16-bit, but the result is not meant to be taken to be that accurate. You can set a control register value to to generate more accurate values at the expense of current consumption, but it is still not guaranteed to be accurate to the last bit.
the datasheet does not say (at least the register description in chapter 8.2) you have to calculate the 2' complement but stated that the contents of the 2 registers is in 2's complement.
so all you have to do is receive the two bytes and cast it to an int16_t to get the signed raw value.
uint8_t xl = 0x00;
uint8_t xh = 0xFC;
int16_t x = (int16_t)((((uint16)xh) << 8) | xl);
or
uint8_t xa[2] {0x00, 0xFC}; // little endian: lower byte to lower address
int16_t x = *((int16*)xa);
(hope i did not mixed something up with this)
I have another approach, which may be easier to implement as the compiler will do all of the work for you. The compiler will probably do it most efficiently and with no bugs too.
Read the raw data into the raw field in:
typedef union
{
struct
{
// in low power - 8 significant bits, left justified
int16 reserved : 8;
int16 value : 8;
} lowPower;
struct
{
// in normal power - 10 significant bits, left justified
int16 reserved : 6;
int16 value : 10;
} normalPower;
struct
{
// in high resolution - 12 significant bits, left justified
int16 reserved : 4;
int16 value : 12;
} highPower;
// the raw data as read from registers H and L
uint16 raw;
} LIS3DH_RAW_CONVERTER_T;
than use the value needed according to the power mode you are using.
Note: In this example, bit fields structs are BIG ENDIANS.
Check if you need to reverse the order of 'value' and 'reserved'.
The LISxDH sensors are 2's complement, left-justified. They can be set to 12-bit, 10-bit, or 8-bit resolution. This is read from the sensor as two 8-bit values (LSB, MSB) that need to be assembled together.
If you set the resolution to 8-bit, just can just cast LSB to int8, which is the likely your processor's representation of 2's complement (8bit). Likewise, if it were possible to set the sensor to 16-bit resolution, you could just cast that to an int16.
However, if the value is 10-bit left justified, the sign bit is in the wrong place for an int16. Here is how you convert it to int16 (16-bit 2's complement).
1.Read LSB, MSB from the sensor:
[MMMM MMMM] [LL00 0000]
[1001 0101] [1100 0000] //example = [0x95] [0xC0] (note that the LSB comes before MSB on the sensor)
2.Assemble the bytes, keeping in mind the LSB is left-justified.
//---As an example....
uint8_t byteMSB = 0x95; //[1001 0101]
uint8_t byteLSB = 0xC0; //[1100 0000]
//---Cast to U16 to make room, then combine the bytes---
assembledValue = ( (uint16_t)(byteMSB) << UINT8_LEN ) | (uint16_t)byteLSB;
/*[MMMM MMMM LL00 0000]
[1001 0101 1100 0000] = 0x95C0 */
//---Shift to right justify---
assembledValue >>= (INT16_LEN-numBits);
/*[0000 00MM MMMM MMLL]
[0000 0010 0101 0111] = 0x0257 */
3.Convert from 10-bit 2's complement (now right-justified) to an int16 (which is just 16-bit 2's complement on most platforms).
Approach #1: If the sign bit (in our example, the tenth bit) = 0, then just cast it to int16 (since positive numbers are represented the same in 10-bit 2's complement and 16-bit 2's complement).
If the sign bit = 1, then invert the bits (keeping just the 10bits), add 1 to the result, then multiply by -1 (as per the definition of 2's complement).
convertedValueI16 = ~assembledValue; //invert bits
convertedValueI16 &= ( 0xFFFF>>(16-numBits) ); //but keep just the 10-bits
convertedValueI16 += 1; //add 1
convertedValueI16 *=-1; //multiply by -1
/*Note that the last two lines could be replaced by convertedValueI16 = ~convertedValueI16;*/
//result = -425 = 0xFE57 = [1111 1110 0101 0111]
Approach#2: Zero the sign bit (10th bit) and subtract out half the range 1<<9
//----Zero the sign bit (tenth bit)----
convertedValueI16 = (int16_t)( assembledValue^( 0x0001<<(numBits-1) ) );
/*Result = 87 = 0x57 [0000 0000 0101 0111]*/
//----Subtract out half the range----
convertedValueI16 -= ( (int16_t)(1)<<(numBits-1) );
[0000 0000 0101 0111]
-[0000 0010 0000 0000]
= [1111 1110 0101 0111];
/*Result = 87 - 512 = -425 = 0xFE57
Link to script to try out (not optimized): http://tpcg.io/NHmBRR

Resources