I have to load the data from a file.
Each sample is 20-dimensional.
So I used this data structure to help me with this:
class DataType
{
vector<float> d;
}
But while I use this variable definition, it can not work.
thrust::host_vector<DataType> host_input;
// after initializing the host input;
thrust::device_vector<DataType> device_input = host_input;
for(unsigned int i = 0; i < device_input.size(); i++)
for(unsigned int j = 0; j < dim; j++)
cout<<device_input[i].d[j]<<end;
It does not work. The compiler told me that I can not use the vector(host) into the device_input. Because device_input will be implemented on the device(gpu), while vector will be implemented on the CPU.
Then, what is the suitable way for me to give an correct definition of DataType?
std::vector requires host side dynamic mem allocation, so it can not be used in device side.
This should work.
class DataType
{
float d[20];
}
Related
Consider the following code:
void foo(float* __restrict__ a)
{
int i; float val;
for (i = 0; i < 100; i++) {
val = 2 * i;
a[i] = val;
}
}
void bar(float* __restrict__ a)
{
int i; float val = 0.0;
for (i = 0; i < 100; i++) {
a[i] = val;
val += 2.0;
}
}
They're based on Examples 7.26a and 7.26b in Agner Fog's Optimizing software in C++ and should do the same thing; bar is more "efficient" as written in the sense that we don't do an integer-to-float conversion at every iteration, but rather a float addition which is cheaper (on x86_64).
Here are the clang and gcc results on these two functions (with no vectorization and unrolling).
Question: It seems to me that the optimization of replacing a multiplication by the loop index with an addition of a constant value - when this is beneficial - should be carried out by compilers, even if (or perhaps especially if) there's a type conversion involved. Why is this not happening for these two functions?
Note that if we use int's rather than float's:
void foo(int* __restrict__ a)
{
int i; int val = 0;
for (i = 0; i < 100; i++) {
val = 2 * i;
a[i] = val;
}
}
void bar(int* __restrict__ a)
{
int i; int val = 0;
for (i = 0; i < 100; i++) {
a[i] = val;
val += 2;
}
}
Both clang and gcc perform the expected optimization, albeit not quite in the same way (see this question).
You are looking for enabling induction variable optimization for floating point numbers. This optimization is generally unsafe in floating point land as it changes program semantics. In your example it'll work because both initial value (0.0) and step (2.0) can be precisely represented in IEEE format but this is a rare case in practice.
It could be enabled under -ffast-math but it seems this wasn't considered as important case in GCC as it rejects non-integral induction variables early on (see tree-scalar-evolution.c).
If you believe that this is an important usecase you might consider filing request at GCC Bugzilla.
I am new to OpenMP and I am stuck with a basic operation. Here is a sample code for my question.
#include <omp.h>
int main(void)
{
int A[16] = {1,2,3,4,5 ...... 16};
#pragma omp parallel for firstprivate(A)
for(int i = 0; i < 4; i++)
{
for(int j = 0; j < 4; j++)
{
A[i*4+j] = Process(A[i*4+j]);
}
}
}
As evident,value of A is local to each thread. However, at the end, I want to write back part of A calculated by each threadto the corresponding position in global variable A. How this can be accomplished?
Simply make A shared. This is fine, because all loop iterations operate on separate elements of A. Remember that OpenMP is shared memory programming.
You can do so explicitly by using shared instead of firstprivate, or simply remove the declaration:
int A[16] = {1,2,3,4,5 ...... 16};
#pragma omp parallel for
for(int i = 0; i < 4; i++)
By default all variables declared outside of the parallel region. You can find an extended exemplary description in this answer.
When I try this I get the wrong result at 'output' even though I am copying the values of 'cum' array to output.
But if I rename the 'cum' array mentioned earlier in the code. I get the correct value of array. Therefore I am unable to reuse the result values.
The device has 8 cores with no shared memory.
Any and all comments/suggestions appreciated.
kernel void histogram(global unsigned int *input,
global unsigned int *output,
global unsigned int *frequency,
global unsigned int *cum,
unsigned int N)
{
int pid = get_global_id(0);
//cumulative sum
for(int i=0; i < 16; i++)
{
cum[(i*16)+(2*pid)+1] = frequency[(i*16)+(2*pid)] + frequency[(i*16)+(2*pid)+1];
}
barrier(CLK_GLOBAL_MEM_FENCE);
for(int i=0; i < 32; i++)
{
output[(i*8)+pid] = cum[(i*8)+pid];
}
barrier(CLK_GLOBAL_MEM_FENCE);
}
Make sure you understand parallel prefix sums. In particular I don't see a downsweep step of the total sum or parts:
Parallel Prefix Sum (Scan) with CUDA
I'd look in the TI's Keystone II SDK you're using in OpenCL device memory read/write issue to see if they have any scan or parallel prefix sum implementations or built in functions.
I would like to know if these 2 codes are the same for performance respecting the variable declarations:
int Value;
for(int i=0; i<1000; i++)
{ Value = i;
}
or
for(int i=0; i<1000; i++)
{ int Value = i;
}
Basically I need to know if the process time to create the variable Value and allocate it in Ram is just once in the first case and if it is, or not, repeated 1000 times in the second.
If you are Programming in c++ or c#, there will be no runtime difference since no implicite initialization will be done for simple int type.
Here is my question about openacc.
I read the APIs (v1 and v2), and the behavior of nested data environment with different subparts of the same array is unclear to me.
Code example:
#pragma acc data pcopyin(a[0:20])
{
#pragma acc data pcopyin(a[100:20])
{
#pragma acc parallel loop
for(i=0; i<20; i++)
a[i] = i;
a[i+100] = i;
}
}
My understanding is that this should work (or at leaste the two acc data parts):
The first pragma checks if a[0,20] is on the accelerator
NO -> data are allocated on the device and transferred
The second pragma checks if a[100,120] is on the accelerator
The pointer a is on the accelerator, but not the data from a[100,120]
The data are allocated on the device and transferred
I tried this kind of thing with CAPS compiler (v3.3.0 which is the only available right now on my test machine), and the second pragma acc data returns me an error (my second subarray don't have the correct shape).
So what happens with my test (I suppose) is that the pointer "a" was found on the accelerator, but the shape associated with it ([0:20]) is not the same in my second pragma ([100:20]).
Is this the normal behavior planned in the API, or should my example work?
Moreover, if this is supposed to work, is there some sort of coherence between the subparts of the same array (somehow, they will be positionned like on the host and I will be able to put a[i] += a[100+i] in my kernel)?
The present test will be looking if "a" is on the device. Hence, when the second data region is encountered, "a" is already on the device but only partially. Instead, a better method would be to add a pointer to point into "a" and reference this pointer on the device. Something like:
#include <stdio.h>
int main () {
int a[200];
int *b;
int i;
for(i=0; i<200; i++) a[i] = 0;
b=a+100;
#pragma acc data pcopy(a[0:20])
{
#pragma acc data pcopy(b[0:20])
{
#pragma acc parallel loop
for(i=0; i<20; i++) {
a[i] = i;
b[i] = i;
}
}
}
for(i=0; i<22; i++) printf("%d = %d \n", i, a[i]);
for(i=100; i<122; i++) printf("%d = %d \n", i, a[i]);
return 0;
}
If you had just copied "a[100:20]", then accessing outside this range would be considered a programmer error.
Hope this helps,
Mat