I'm wondering if it's possible to hint to gcc that a pointer points to an aligned boundary. if I have a function:
void foo ( void * pBuf ) {
uint64_t *pAligned = pBuf;
pAligned = ((pBuf + 7) & ~0x7);
var = *pAligned; // I want this to be aligned 64 bit access
}
And I know that pBuf is 64 bit aligned, is there any way to tell gcc that pAligned points to a 64 bit boundary? If I do:
uint64_t *pAligned __attribute__((aligned(16)));
I believe that means that the address of the pointer is 64 bit aligned, but it doesn't tell the compiler that the what it points to is aligned, and therefore the compiler would likely tell it to do an unaligned fetch here. This could slow things down if I'm looping through a large array.
There are several ways to inform GCC about alignment.
Firstly you can attach align attribute to pointee, rather than pointer:
int foo() {
int __attribute__((aligned(16))) *p;
return (unsigned long long)p & 3;
}
Or you can use (relatively new) builtin:
int bar(int *p) {
int *pa = __builtin_assume_aligned(p, 16);
return (unsigned long long)pa & 3;
}
Both variants optimize to return 0 due to alignment.
Unfortunately the following does not seem to work:
typedef int __attribute__((aligned(16))) *aligned_ptr;
int baz(aligned_ptr p) {
return (unsigned long long)p & 3;
}
and this one does not either
typedef int aligned_int __attribute__((aligned (16)));
int braz(aligned_int *p) {
return (unsigned long long)p & 3;
}
even though docs suggest the opposite.
Related
Is it OK to use both parts of a union if you know the parts don't overlap? Like in this example, is it OK to use both buf[31] as well as ps?
struct PtrSize {
const char *data;
size_t size;
};
class SmallStringOrNot {
union {
PtrSize ps;
char buf[32];
} pb;
public:
bool IsSmallString() const {
return pb.buf[31] != 0;
}
SmallStringOrNot(const char *str) {
size_t len = strlen(str);
if (len && len < 31) {
memcpy(pb.buf, str, len);
pb.buf[31] = len;
} else {
pb.ps.data = str;
pb.ps.size = len;
pb.buf[31] = 0; // is this OK, accessing buf right after ps?
}
}
PtrSize AsPtrSize() const {
if (IsSmallString()) {
return PtrSize{pb.buf, pb.buf[31]};
} else {
return pb.ps;
}
}
};
Unfortunately the code is not OK: you are at least not in "undefined behaviour"-zone, since in C++ it is always legal to access a union through a char member, but you have no guarantee that by modifying buf[31] you are not altering ps.data or ps.size. In a 128-bit machine you would almost surely be doing it.
On more normal architectures, your code should be fine but for a 100% guarantee you should refer to the compiler documentation, since size_t could in principle be bigger than a void*. For example, even on a 64-bit machine you could theoretically have a 192-bit ps.size member (which summed with the 64 bit of the ps.data pointer would make the PtrSize completely overlap the buffer.
if TStruct is packed, then this code ends with Str.D == 0x00223344 (not 0x11223344). Why? ARM GCC 4.7
#include <string.h>
typedef struct {
unsigned char B;
unsigned int D;
} __attribute__ ((packed)) TStruct;
volatile TStruct Str;
int main( void) {
memset((void *)&Str, 0, sizeof(Str));
Str.D = 0x11223344;
if(Str.D != 0x11223344) {
return 1;
}
return 0;
}
I guess your problem has nothing to do with unaligned access, but with structure definition. int is not necessarily 32 bit long. According to the C standard, int is at least 16 bit long, and char is at least 8 bits long.
My guess is, Your compiler optimizes TStruct so it looks like this:
struct {
unsigned char B : 8;
unsigned int D : 24;
} ...;
When you are assigning 0x11223344 to Str.D, than according to the C standard, the compiler must only make sure that at least 16 bits (0x3344) are written to Str.D. You didn't specify that Str.D is 32 bit long, only that it is at least 16 bits long.
Your compiler may also arrange the struct like this:
struct {
unsigned char B : 16;
unsigned int D : 16;
} ...;
B is at least 8 bits long, and D is at least 16 bits long, all ok.
Probably, what you want to do, is:
#include <stdint.h>
typedef struct {
uint8_t B;
uint32_t D;
} __attribute__((packed)) TStruct;
That way You can ensure a 32-bit value 0x11223344 properly writes to Str.D. It is a good idea to use size constrained types for __packed structs.
As for unaligned access of a member inside a struct, the compiler should take care of it. If a compiler knows the structure definition, then when you are accessing Str.D it should take care of any unaligned access and bit/byte operations.
How do I initialize device array which is allocated using cudaMalloc()?
I tried cudaMemset, but it fails to initialize all values except 0.code, for cudaMemset looks like below, where value is initialized to 5.
cudaMemset(devPtr,value,number_bytes)
As you are discovering, cudaMemset works like the C standard library memset. Quoting from the documentation:
cudaError_t cudaMemset ( void * devPtr,
int value,
size_t count
)
Fills the first count bytes of the memory area pointed to by devPtr
with the constant byte value value.
So value is a byte value. If you do something like:
int *devPtr;
cudaMalloc((void **)&devPtr,number_bytes);
const int value = 5;
cudaMemset(devPtr,value,number_bytes);
what you are asking to happen is that each byte of devPtr will be set to 5. If devPtr was a an array of integers, the result would be each integer word would have the value 84215045. This is probably not what you had in mind.
Using the runtime API, what you could do is write your own generic kernel to do this. It could be as simple as
template<typename T>
__global__ void initKernel(T * devPtr, const T val, const size_t nwords)
{
int tidx = threadIdx.x + blockDim.x * blockIdx.x;
int stride = blockDim.x * gridDim.x;
for(; tidx < nwords; tidx += stride)
devPtr[tidx] = val;
}
(standard disclaimer: written in browser, never compiled, never tested, use at own risk).
Just instantiate the template for the types you need and call it with a suitable grid and block size, paying attention to the last argument now being a word count, not a byte count as in cudaMemset. This isn't really any different to what cudaMemset does anyway, using that API call results in a kernel launch which is do too different to what I posted above.
Alternatively, if you can use the driver API, there is cuMemsetD16 and cuMemsetD32, which do the same thing, but for half and full 32 bit word types. If you need to do set 64 bit or larger types (so doubles or vector types), your best option is to use your own kernel.
I also needed a solution to this question and I didn't really understand the other proposed solution. Particularly I didn't understand why it iterates over the grid blocks for(; tidx < nwords; tidx += stride) and for that matter, the kernel invocation and why using the counter-intuitive word sizes.
Therefore I created a much simpler monolithic generic kernel and customized it with strides i.e. you may use it to initialize a matrix in multiple ways e.g. set rows or columns to any value:
template <typename T>
__global__ void kernelInitializeArray(T* __restrict__ a, const T value,
const size_t n, const size_t incx) {
int tid = threadIdx.x + blockDim.x * blockIdx.x;
if (tid*incx < n) {
a[tid*incx] = value;
}
}
Then you may invoke the kernel like this:
template <typename T>
void deviceInitializeArray(T* a, const T value, const size_t n, const size_t incx) {
int number_of_blocks = ((n / incx) + BLOCK_SIZE - 1) / BLOCK_SIZE;
dim3 gridDim(number_of_blocks, 1);
dim3 blockDim(BLOCK_SIZE, 1);
kernelInitializeArray<T> <<<gridDim, blockDim>>>(a, value, n, incx);
}
I ported one project from Visual C++ 6.0 to VS 2010 and found that a critical part of the code (scripting engine) now runs in about three times slower than in was before.
After some research I managed to extract code fragment which seems to cause the slowdown. I minimized it as much as possible, so it ill be easier to reproduce the problem.
The problem is reproduced when assigning a complex class (Variant) which contains another class (String), and the union of several other fields of simple types.
Playing with the example I discovered more "magic":
1. If I comment one of unused (!) class members, the speed increases, and the code finally runs faster than those complied with VS 6.2
2. The same is true if I remove the "union" wrapper"
3. The same is true event if change the value of the filed from 1 to 0
I have no idea what the hell is going on.
I have checked all code generation and optimization switches, but without any success.
The code sample is below:
On my Intel 2.53 GHz CPU this test, compiled under VS 6.2 runs 1.0 second.
Compiled under VS 2010 - 40 seconds
Compiled under VS 2010 with "magic" lines commented - 0.3 seconds.
The problem is reproduces with any optimization switch, but the "Whole program optimization" (/GL) should be disabled. Otherwise this too smart optimizer will know that out test actually does nothing, and the test will run 0 seconds.
#include <windows.h>
#include <stdio.h>
#include <stdlib.h>
class String
{
public:
char *ptr;
int size;
String() : ptr(NULL), size( 0 ) {};
~String() {if ( ptr != NULL ) free( ptr );};
String& operator=( const String& str2 );
};
String& String::operator=( const String& string2 )
{
if ( string2.ptr != NULL )
{
// This part is never called in our test:
ptr = (char *)realloc( ptr, string2.size + 1 );
size = string2.size;
memcpy( ptr, string2.ptr, size + 1 );
}
else if ( ptr != NULL )
{
// This part is never called in our test:
free( ptr );
ptr = NULL;
size = 0;
}
return *this;
}
struct Date
{
unsigned short year;
unsigned char month;
unsigned char day;
unsigned char hour;
unsigned char minute;
unsigned char second;
unsigned char dayOfWeek;
};
class Variant
{
public:
int dataType;
String valStr; // If we comment this string, the speed is OK!
// if we drop the 'union' wrapper, the speed is OK!
union
{
__int64 valInteger;
// if we comment any of these fields, unused in out test, the speed is OK!
double valReal;
bool valBool;
Date valDate;
void *valObject;
};
Variant() : dataType( 0 ) {};
};
void TestSpeed()
{
__int64 index;
Variant tempVal, tempVal2;
tempVal.dataType = 3;
tempVal.valInteger = 1; // If we comment this string, the speed is OK!
for ( index = 0; index < 200000000; index++ )
{
tempVal2 = tempVal;
}
}
int main(int argc, char* argv[])
{
int ticks;
char str[64];
ticks = GetTickCount();
TestSpeed();
sprintf( str, "%.*f", 1, (double)( GetTickCount() - ticks ) / 1000 );
MessageBox( NULL, str, "", 0 );
return 0;
}
This was rather interesting. First I was unable to reproduce the slow down in release build, only in debug build. Then I turned off SSE2 optimizations and got the same ~40s run time.
The problem seems to be in the compiler generated copy assignment for Variant. Without SSE2 it actually does a floating point copy with fld/fstp instructions because the union contains a double. And with some specific values this apparently is a really expensive operation. The 64-bit integer value 1 maps to 4.940656458412e-324#DEN which is a denormalized number and I believe this causes problems. When you leave tempVal.valInteger uninitialized it may contain a value that works faster.
I did a small test to confirm this:
union {
uint64_t i;
volatile double d1;
};
i = 0xcccccccccccccccc; //with this value the test takes 0.07 seconds
//i = 1; //change to 1 and now the test takes 36 seconds
volatile double d2;
for(int i=0; i<200000000; ++i)
d2 = d1;
So what you could do is define your own copy assignment for Variant that just does a simple memcpy of the union.
Variant& operator=(const Variant& rhs)
{
dataType = rhs.dataType;
union UnionType
{
__int64 valInteger;
double valReal;
bool valBool;
Date valDate;
void *valObject;
};
memcpy(&valInteger, &rhs.valInteger, sizeof(UnionType));
valStr = rhs.valStr;
return *this;
}
Is there efficient way to do this?
That's something you could be using a union for:
union {
UINT64 ui64;
struct {
DWORD d0;
DWORD d1;
} y;
} un;
un.ui64 = 27;
// Use un.y.d0 and un.y.d1
An example (under Linix so using different types):
#include <stdio.h>
union {
long ui64;
struct {
int d0;
int d1;
} y;
} un;
int main (void) {
un.ui64 = 27;
printf ("%d %d\n", un.y.d0, un.y.d1);
return 0;
}
This produces:
27 0
Thought I would provide an example using LARGE_INTEGER FOR the windows platform.
If I have a variable called "value" that is 64 bit, then I do:
LARGE_INTEGER li;
li.QuadPart = value;
DWORD low = li.LowPart;
DWORD high = li.HighPart;
Yes, this copies it, but I like the readability of it.
Keep in mind that 64-bit integers have alignment restrictions at least as great as 32-bit integers on all platforms. Therefore, it's perfectly safe to cast a pointer to a 64-bit integer as a pointer to a 32-bit.
ULONGLONG largeInt;
printf( "%u %u\n", ((DWORD *)&largeInt)[ 0 ], ((DWORD *)&largeInt)[ 1 ] );
Obviously, Pax's solution is a lot cleaner, but this is technically more efficient since it doesn't require any data copying.