I would like to know if these 2 codes are the same for performance respecting the variable declarations:
int Value;
for(int i=0; i<1000; i++)
{ Value = i;
}
or
for(int i=0; i<1000; i++)
{ int Value = i;
}
Basically I need to know if the process time to create the variable Value and allocate it in Ram is just once in the first case and if it is, or not, repeated 1000 times in the second.
If you are Programming in c++ or c#, there will be no runtime difference since no implicite initialization will be done for simple int type.
Related
I need to set the last element inside an array by multiply the last "i" with it self like this. but when i try to do i*i, i is undefined. also, when i try to print the result, cout is undefined.
void firstArray(void)
{
int MyArray[10] ;
for (unsigned int i=0; i<10; ++i)
{
MyArray[i] = i;
}
MyArray [9] = i*i;
for (unsigned int i=0; i<10; ++i)
{
cout(MyArray[i]);
}
}
I tried to put MyArray [9] = i*i inside the loop; with a condition( and it whould work), but i cant use any if for this assignment.
also, I tried to put with System.out like in java before cout, but System is undefined.
what do i need to change to make it work?
Ok, so first of all, if you want "i" outside of your loop, you need to initialize "i" outside of your loop too.
... Unsigned int i; for(i =0; i<10; i++) ...
Now, "i" will be equal the last increment outside your loop.
Also, i suggest you to read the basics for c++, System dont exist in c++, instead its with std namespace.
There is two way to do that:
Using namespace std;
Inside the function, or by write "std::" before cout and:
Std::cout << MyArray[i]
Pay attention how i wrote it, you will find
How to do in the c++ website:
https://www.cplusplus.com/reference/iostream/cout/
When I try this I get the wrong result at 'output' even though I am copying the values of 'cum' array to output.
But if I rename the 'cum' array mentioned earlier in the code. I get the correct value of array. Therefore I am unable to reuse the result values.
The device has 8 cores with no shared memory.
Any and all comments/suggestions appreciated.
kernel void histogram(global unsigned int *input,
global unsigned int *output,
global unsigned int *frequency,
global unsigned int *cum,
unsigned int N)
{
int pid = get_global_id(0);
//cumulative sum
for(int i=0; i < 16; i++)
{
cum[(i*16)+(2*pid)+1] = frequency[(i*16)+(2*pid)] + frequency[(i*16)+(2*pid)+1];
}
barrier(CLK_GLOBAL_MEM_FENCE);
for(int i=0; i < 32; i++)
{
output[(i*8)+pid] = cum[(i*8)+pid];
}
barrier(CLK_GLOBAL_MEM_FENCE);
}
Make sure you understand parallel prefix sums. In particular I don't see a downsweep step of the total sum or parts:
Parallel Prefix Sum (Scan) with CUDA
I'd look in the TI's Keystone II SDK you're using in OpenCL device memory read/write issue to see if they have any scan or parallel prefix sum implementations or built in functions.
I was recently told to improve a for loop is convenient to use decremental for loop instead of incremental loop, like
For(int i=0; i<Limit;i++)
{
//code
}
For(int i=Limit-1; i >=0; i--)
{
//code
}
I am not seeing why some people would recommend using this, their argument was
"using incremental for loops increases the number of validations inside the loop. when using decremental loops validations and processing time are reduced"
Often times, you will see something like this:
for(int i = 0; i < list.length(); i++) {
}
But often you could do instead
for(int i = list.length(); i >= 0; i--) {
}
That way, list.length() is called once.
Of course, this is also a possiblity:
int length = list.length();
for(int i = 0; i < length; i++) {
}
But the first way was shorter.
Rarely do I ever think of fixing a real performance problem with such a trivial and (usually) inconsequential fix. And if for some reason .length() is really eating CPU time, I prefer the last way.
I have to load the data from a file.
Each sample is 20-dimensional.
So I used this data structure to help me with this:
class DataType
{
vector<float> d;
}
But while I use this variable definition, it can not work.
thrust::host_vector<DataType> host_input;
// after initializing the host input;
thrust::device_vector<DataType> device_input = host_input;
for(unsigned int i = 0; i < device_input.size(); i++)
for(unsigned int j = 0; j < dim; j++)
cout<<device_input[i].d[j]<<end;
It does not work. The compiler told me that I can not use the vector(host) into the device_input. Because device_input will be implemented on the device(gpu), while vector will be implemented on the CPU.
Then, what is the suitable way for me to give an correct definition of DataType?
std::vector requires host side dynamic mem allocation, so it can not be used in device side.
This should work.
class DataType
{
float d[20];
}
I have written the following simple C++ code.
#include <iostream>
#include <omp.h>
int main()
{
int myNumber = 0;
int numOfHits = 0;
cout << "Enter my Number Value" << endl;
cin >> myNumber;
#pragma omp parallel for reduction(+:numOfHits)
for(int i = 0; i <= 100000; ++i)
{
for(int j = 0; j <= 100000; ++j)
{
for(int k = 0; k <= 100000; ++k)
{
if(i + j + k == myNumber)
numOfHits++;
}
}
}
cout << "Number of Hits" << numOfHits << endl;
return 0;
}
As you can see I use OpenMP to parallelize the outermost loop. What I would like to do is to rewrite this small code in CUDA. Any help will be much appreciated.
Well, I can give you a quick tutorial, but I won't necessarily write it all for you.
So first of all, you will want to get MS Visual Studio set up with CUDA, which is easy following this guide: http://www.ademiller.com/blogs/tech/2011/05/visual-studio-2010-and-cuda-easier-with-rc2/
Now you will want to read The NVIDIA CUDA Programming Guide (free pdf), documentation, and CUDA by Example (A book I highly recommend for learning CUDA).
But let's say you haven't done that yet, and definitely will later.
This is an extremely arithmetic heavy and data-light computation - actually it can be computed without this brute force method fairly simply, but that isn't the answer you are looking for. I suggest something like this for the kernel:
__global__ void kernel(int* myNumber, int* numOfHits){
//a shared value will be stored on-chip, which is beneficial since this is written to multiple times
//it is shared by all threads
__shared__ int s_hits = 0;
//this identifies the current thread uniquely
int i = (threadIdx.x + blockIdx.x*blockDim.x);
int j = (threadIdx.y + blockIdx.y*blockDim.y);
int k = 0;
//we increment i and j by an amount equal to the number of threads in one dimension of the block, 16 usually, times the number of blocks in one dimension, which can be quite large (but not 100,000)
for(; i < 100000; i += blockDim.x*gridDim.x){
for(; j < 100000; j += blockDim.y*gridDim.y){
//Thanks to talonmies for this simplification
if(0 <= (*myNumber-i-j) && (*myNumber-i-j) < 100000){
//you should actually use atomics for this
//otherwise, the value may change during the 'read, modify, write' process
s_hits++;
}
}
}
//synchronize threads, so we now s_hits is completely updated
__syncthreads();
//again, atomics
//we make sure only one thread per threadblock actually adds in s_hits
if(threadIdx.x == 0 && threadIdx.y == 0)
*numOfHits += s_hits;
return;
}
To launch the kernel, you will want something like this:
dim3 blocks(some_number, some_number, 1); //some_number should be hand-optimized
dim3 threads(16, 16, 1);
kernel<<<blocks, threads>>>(/*args*/);
I know you probably want a quick way to do this, but getting into CUDA isn't really a 'quick' thing. As in, you will need to do some reading and some setup to get it working; past that, the learning curve isn't too high. I haven't told you anything about memory allocation yet, so you will need to do that (although that is simple). If you followed my code, my goal is that you had to read up a bit on shared memory and CUDA, and so you are already kick-started. Good luck!
Disclaimer: I haven't tested my code, and I am not an expert - it could be idiotic.