This code returns always the same value for the double* pointers u and r. However, if I remove the attribute__((aligned(16))) it works correct. Is this usage of attribute_((algined(16))) invalid?
Edit: I'm using g++ 5.4.0 on ubuntu 16.04 (32bit)
int main(void){
double* __restrict __attribute__((aligned(16))) u=(double*)aligned_alloc(16,1252*2502*sizeof(double));
std::cerr<<"u after alloc: "<<u<<std::endl;
memset(u,0,1252*2502*sizeof(double));
std::cerr<<"u after memset: "<<u<<std::endl;
double* __restrict __attribute__((aligned(16)))r=(double*)aligned_alloc(16,1252*2502*sizeof(double));
std::cerr<<"r after alloc: "<<r<<std::endl;
memset(r,0,1252*2502*sizeof(double));
std::cerr<<"r after memset:"<<r<<std::endl<<std::endl;
}
Related
I know there is a mistake in my cod because I didn't allocate any memory. But I'm curious to know why sizeof(struct node) shows 16 in my computer although I haven't allocated memory yet.
`
#include <stdio.h>
#include <stdlib.h>
struct node
{
int data;
struct node *next;
};
int main(int argc, char const *argv[])
{
printf("%zu\n", sizeof(struct node));
return 0;
}
`
I thought a size zero would return but It didn't happend. Can you explain why sizeof(struct node) retuns 16?
You don't say if you're working in C or C++, but sizeof semantics are similar in this case, regardless.
https://en.cppreference.com/w/cpp/language/sizeof is a good place to start.
sizeof(type) returns the size in bytes of the object representation of type.
It tells you how much memory you will need to allocate for one of those things. The information (the size of the type) is known at compile time, so there's no reason you can't get it without actually allocating memory.
And in fact if you were to allocate memory with malloc:
myNode = malloc(sizeof (struct node))
In that line of code, sizeof(struct node) is being calculated before memory is allocated. It's calculated at compile time, so that the code generated is essentially malloc(16).
I want to try and use OpenCL on Windows, with a CPU, something I've not done before (I've used OpenCL on Linux on NVIDIA cards).
So,
I visited Intel's website, downloaded http://registrationcenter-download.intel.com/akdlm/irc_nas/vcp/13794/opencl_runtime_18.1_x64_setup.msi, and installed it.
I have Cygwin installed with g++ 7.4.0 and clang++ 8.0.1
What steps do I need to take in order to compile and run a program invoking some kind of "hello world" kernel, say:
__kernel void vectorAdd(
__global float* result,
__global const float* lhs,
__global const float* rhs,
int length)
{
int pos = get_global_id(0);
if (pos < length) {
result[pos] = lhs[pos] + rhs[pos];
}
}
My question: What steps do I need to take to realize that?
PS - If you suggest I download a non-Cygwin MS compiler, I'm willing to - as long as it's not a commercial for-pay product.
Problem
I am working with flash memory optimization of STM32F051. It's revealed, that conversion between floatand int types consumes a lot of flash.
Digging into this, it turned out that the conversion to int takes around 200 bytes of flash memory; while the conversion to unsigned int takes around 1500 bytes!
It’s known, that both int and unsigned int differ only by the interpretation of the ‘sign’ bit, so such behavior – is a great mystery for me.
Note: Performing the 2-stage conversion float -> int -> unsigned int also consumes only around 200 bytes.
Questions
Analyzing that, I have such questions:
1) What is a mechanism of the conversion of float to unsigned int. Why it takes so many memory space, when in the same time conversion float->int->unsigned int takes so little memory? Maybe it’s connected with IEEE 754 standard?
2) Are there any problems expected when the conversion float->int->unsigned int is used instead of a direct float ->int?
3) Are there any methods to wrap float -> unsigned int conversion keeping the low memory footprint?
Note: The familiar question has been already asked here (Trying to understand how the casting/conversion is done by compiler,e.g., when cast from float to int), but still there is no clear answer and my question is about the memory usage.
Technical data
Compiler: ARM-NONE-EABI-GCC (gcc version 4.9.3 20141119 (release))
MCU: STM32F051
MCU's core: 32 bit ARM CORTEX-M0
Code example
float -> int (~200 bytes of flash)
int main() {
volatile float f;
volatile int i;
i = f;
return 0;
}
float -> unsigned int (~1500 bytes! of flash)
int main() {
volatile float f;
volatile unsigned int ui;
ui = f;
return 0;
}
float ->int-> unsigned int (~200 bytes of flash)
int main() {
volatile float f;
volatile int i;
volatile unsigned int ui;
i = f;
ui = i;
return 0;
}
There is no fundamental reason for the conversion from float to unsigned int should be larger than the conversion from float to signed int, in practice the float to unsigned int conversion can be made smaller than the float to signed int conversion.
I did some investigations using the GNU Arm Embedded Toolchain (Version 7-2018-q2) and
as far as I can see the size problem is due to a flaw in the gcc runtime library. For some reason this library does not provide an specialized version of the __aeabi_f2uiz function for Arm V6m, instead it falls back on a much larger general version.
I am in the process of implementing multithreading through a NVIDIA GeForce GT 650M GPU for a simulation I have created. In order to make sure everything works properly, I have created some side code to test that everything works. At one point I need to update a vector of variables (they can all be updated separately).
Here is the gist of it:
`\__device__
int doComplexMath(float x, float y)
{
return x+y;
}`
`// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y, vector<complex<long double> > *z)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride)
z[i] = doComplexMath(*x, *y);
}`
`int main(void)
{
int iGAMAf = 1<<10;
float *x, *y;
vector<complex<long double> > VEL(iGAMAf,zero);
// Allocate Unified Memory – accessible from CPU or GPU
cudaMallocManaged(&x, sizeof(float));
cudaMallocManaged(&y, sizeof(float));
cudaMallocManaged(&VEL, iGAMAf*sizeof(vector<complex<long double> >));
// initialize x and y on the host
*x = 1.0f;
*y = 2.0f;
// Run kernel on 1M elements on the GPU
int blockSize = 256;
int numBlocks = (iGAMAf + blockSize - 1) / blockSize;
add<<<numBlocks, blockSize>>>(iGAMAf, x, y, *VEL);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
return 0;
}`
I am trying to allocate unified memory (memory accessible from the GPU and CPU). When compiling using nvcc, I get the following error:
error: no instance of overloaded function "cudaMallocManaged" matches the argument list
argument types are: (std::__1::vector, std::__1::allocator>> *, unsigned long)
How can I overload the function properly in CUDA to use this type with multithreading?
It isn't possible to do what you are trying to do.
To allocate a vector using managed memory you would have to write your own implementation of an allocator which inherits from std::allocator_traits and calls cudaMallocManaged under the hood. You can then instantiate a std::vector using your allocator class.
Also note that your CUDA kernel code is broken in that you can't use std::vector in device code.
Note that although the question has managed memory in view, this is applicable to other types of CUDA allocation such as pinned allocation.
As another alternative, suggested here, you could consider using a thrust host vector in lieu of std::vector and use a custom allocator with it. A worked example is here in the case of pinned allocator (cudaMallocHost/cudaHostAlloc).
I am currently experimenting with the GCC vector extensions. However, I am wondering how to go about getting sqrt(vec) to work as expected.
As in:
typedef double v4d __attribute__ ((vector_size (16)));
v4d myfunc(v4d in)
{
return some_sqrt(in);
}
and at least on a recent x86 system have it emit a call to the relevant intrinsic sqrtpd. Is there a GCC builtin for sqrt that works on vector types or does one need to drop down to the intrinsic level to accomplish this?
Looks like it's a bug: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54408 I don't know of any workaround other than do it component-wise. The vector extensions were never meant to replace platform specific intrinsics anyway.
Some funky code to this effect:
#include <cmath>
#include <utility>
template <::std::size_t...> struct indices { };
template <::std::size_t M, ::std::size_t... Is>
struct make_indices : make_indices<M - 1, M - 1, Is...> {};
template <::std::size_t... Is>
struct make_indices<0, Is...> : indices<Is...> {};
typedef float vec_type __attribute__ ((vector_size(4 * sizeof(float))));
template <::std::size_t ...Is>
vec_type sqrt_(vec_type const& v, indices<Is...> const)
{
vec_type r;
::std::initializer_list<int>{(r[Is] = ::std::sqrt(v[Is]), 0)...};
return r;
}
vec_type sqrt(vec_type const& v)
{
return sqrt_(v, make_indices<4>());
}
int main()
{
vec_type v;
return sqrt(v)[0];
}
You could also try your luck with auto-vectorization, which is separate from the vector extension.
You can loop over the vectors directly
#include <math.h>
typedef double v2d __attribute__ ((vector_size (16)));
v2d myfunc(v2d in) {
v2d out;
for(int i=0; i<2; i++) out[i] = sqrt(in[i]);
return out;
}
The sqrt function has to trap for signed zero and NAN but if you avoid these with -Ofast both Clang and GCC produce simply sqrtpd.
https://godbolt.org/g/aCuovX
GCC might have a bug because I had to loop to 4 even though there are only 2 elements to get optimal code.
But with AVX and AVX512 GCC and Clang are ideal
AVX
https://godbolt.org/g/qdTxyp
AVX512
https://godbolt.org/g/MJP1n7
My reading of the question is that you want the square root of 4 packed double precision values... that's 32 bytes. Use the appropriate AVX intrinsic:
#include <x86intrin.h>
typedef double v4d __attribute__ ((vector_size (32)));
v4d myfunc (v4d v) {
return _mm256_sqrt_pd(v);
}
x86-64 gcc 10.2 and x86-64 clang 10.0.1
using -O3 -march=skylake :
myfunc:
vsqrtpd %ymm0, %ymm0 # (or just `ymm0` for Intel syntax)
ret
ymm0 is the return value register.
That said, it just so happens there is a builtin: __builtin_ia32_sqrtpd256, which doesn't require the intrinsics header. I would definitely discourage its use however.