Imnitializing a prmitive array - corrupting other arrays

Imnitializing a prmitive array - corrupting other arrays - xcode

I am using primitive array type in xcode. Example:
int matrix [10][10];
I am also using a simple loop to initialise the array
for(int x=0;x<=10;x++)
for(int y=0;y<=1;0y++)
matrix[x][y] = 0;
I initialize sevreal matrices in this manner throughout the code. I noticed at times after the initialization is performed, an array that was prviously initialized or updated now contains garbage. Is there a simpler way to initialize an array of this type. And/Or why does this seem to corrupt other arrays.

Your array has 10 positions in both dimensions, but your loops go up to eleven.
Try
for(int x = 0; x < 10; x++)
for(int y = 0; y < 10; y++)
matrix[x][y] = 0;
Notice the use of the lesser than comparator instead of lesser than or equal to.

I suppose you aren't declaring different variables for different matrices and are by mistake overwriting them.

Try It...
matrix=[[NSArray alloc]init];

int myArray[10][10] = {};
THis will create the array and initialize all occurrences to 0;

The most likely cause of corruption like you're seeing (provided that you haven't made the error that #Renan notes), is that you're expecting a stack pointer to exist outside of its scope. You can't, for instance, return matrix to a caller, since the stack frame it's created on will vanish.
Also, since you're allocating on the stack, you need to be careful of your matrix size. If it gets too large, then you'll get stack corruption. 100 ints is generally fine if you're not recursing deeply, but keep in mind the stack limits:
OS X main thread: 8MB
iOS main thread: 1MB
All secondary threads: 512kB
That's the whole stack (all frames, not just the current frame).

Related

GLSL Vulkan compute shader to efficiently prune zeros from a shared array to make a compressed array

I have a vulkan compute shader with an array shared inside the local group and I want to perform the following transform:
Basically, I want to remove/prune all the zeros. Is there a fast or parallel method to do this?
I tried to do this in series as follows
shared int arraySize;
shared int array[256];
shared int compressed_array[256];
/*... prepare array in parallel ... */
// run in series on 1st worker
if(gl_LocalInvocationID.x == 0 && gl_LocalInvocationID.y == 0){
arraySize= 0; // initilize arraySize
for (int i = 0; i < 256; i++) {
if(array[i] > 0) // incrementally search for non-zeroes
{
compressed_array[arraySize] = array[i];
arraySize= arraySize + 1;
}
}
}
but it seems to take 1-2[ms] with 256 elements of the array on my GPU, is there any faster way to do this? A parallel algorithm would presumably be faster, does such an algorithm exist?

Thanks to Yuri Kilochek's answer, I was able to find a solution via parallel prefix-sums/scan
Suppose a 'flag' array was added to the data that is 1 if the corresponding array cell is non-zero and 0 otherwise. Then a 'parallel exclusive scan' on the flag array will yield the index of the compressed_array to which invocations with an active flag should write their respective 'array' contents to (scatter operation).
In Vulkan this can be implemented efficiently using subgroups.
However, each subgroup can typically only perform scans up to 32 (Nvidia) or 64 (AMD) elements long while the local group may be several times larger. To perform scans over the whole local group, a layered approach of scans as described here and coded here is necessary.

Yes. This problem is called parallel stream compaction.

Initializing c++ arrays with references to variables and then values

I saw in some source code something like this
for(int i = 0; i < NUM; i++){
count[i] = new int;
*count[i] = 0;
}
And was wondering what the point was as opposed to just having:
count[i] = 0;

And was wondering what the point was as opposed to just having count[i] = 0;
Well, initializing a pointer to zero has a different meaning than initializing the pointed value to zero.
Dereferencing a pointer that points to a valid object is OK, and returns the value of the object. Dereferencing a pointer with the value zero (i.e. a null pointer) has undefined behaviour.
You may instead be wondering, why would you want to use an array pointers to dynamically allocated integers, instead of an array of integers. You're right to question it, since it is quite rarely a rational choice. However, this snippet doesn't demonstrate any reason for doing so. If possible, you may find out by asking the person who wrote the code.

What is the difference between a dangling pointer and memory leak?

I'm new to C++ and would like to ask if the code below is an example of a dangling pointer or a memory leak because it is pointing outside the dynamically allocated array:
int * n = new int[10];
for (int prev = 0; prev < 10; prev++) {
*n = *(n + prev + 1);
}
delete[] n;
n = nullptr;

A dangling pointer is a pointer which points to an address where no object resides. I.e. it points at invalid memory. The word "dangling" usually carries the connotation that it used to point to something valid and that something got destroyed (either because it was explicitly deallocated or because it went out of scope).
A memory leak happens when you lose all track of dynamically allocated piece of memory; that is, when you "forget" last pointer that was pointing to that memory, meaning you can no longer deallocate it. Your code would create a memory leak if you did n = nullptr; before you call delete[] n;.
If I had to describe your case with one of these two terms, it would be "dangling pointer," simply because you're reaching beyond the buffer in the last iteration. However, I wouldn't normally call it a "dangling pointer," because it was never valid in the first place. I would call this a "buffer overrun" or an "out-of-bounds access."

What is the difference between a dangling pointer and memory leak?
You could say a dangling pointer is the opposite of a memory leak.
One is a pointer that doesn't point to valid memory, and one is valid memory that nothing points to.
(But as the other answers point out, your code is neither.)

Let's make some canonical examples first:
Memory Leak
int *x;
x = new int;
x = nullptr;
We have allocated an integer on the heap, and then we lost track of it. We have no ability to call delete on that integer at this point. This is a memory leak.
Dangling Pointer
int *x;
x = new int;
delete x;
x is now a dangling pointer. It points to something that used to be valid memory. If we were to use *x at this point, we would be accessing memory that we shouldn't be. Normally, to solve this, after delete x;, we do x = nullptr;
Your code
Your code has a different issue, which I'm going to reduce your code to so that we can more easily talk about the same thing:
int *x;
x = new int[10];
x[9] = x[10];
I would describe this as neither of the above cases. It's a buffer overrun.

How to parallelise a nested loop with cross element dependencies in cuda?

I'm a beginner at cuda and am having some difficulties with it
If I have an input vector A and a result vector B both with size N, and B[i] depends on all elements of A except A[i], how can I code this without having to call a kernel multiple times inside a serial for loop? I can't think of a way to paralelise both the outer and inner loop simultaneously.
edit: Have a device with cc 2.0
example:
// a = some stuff
int i;
int j;
double result = 0;
for(i=0; i<1000; i++) {
double ai = a[i];
for(j=0; j<1000; j++) {
double aj = a[j];
if (i == j)
continue;
result += ai - aj;
}
}
I have this at the moment:
//in host
int i;
for(i=0; i<1000; i++) {
kernelFunc <<<2, 500>>> (i, d_a)
}
Is there a way to eliminate the serial loop?

Something like this should work, I think:
__global__ void my_diffs(const double *a, double *b, const length){
unsigned idx = threadIdx.x + blockDim.x*blockIdx.x;
if (idx < length){
double my_a = a[idx];
double result = 0.0;
for (int j=0; j<length; j++)
result += my_a - a[j];
b[idx] = result;
}
}
(written in browser, not tested)
This can possibly be further optimized in a couple ways, however for cc 2.0 and newer devices that have L1 cache, the benefits of these optimizations might be small:
use shared memory - we can reduce the number of global loads to one per element per block. However, the initial loads will be cached in L1, and your data set is quite small (1000 double elements ?) so the benefits might be limited
create an offset indexing scheme, so each thread is using a different element from the cacheline to create coalesced access (i.e. modify j index for each thread). Again, for cc 2.0 and newer devices, this may not help much, due to L1 cache as well as the ability to broadcast warp global reads.
If you must use a cc 1.x device, then you'll get significant mileage out of one or more optimizations -- the code I've shown here will run noticeably slower in that case.
Note that I've chosen not to bother with the special case where we are subtracting a[i] from itself, as that should be approximately zero anyway, and should not disturb your results. If you're concerned about that, you can special-case it out, easily enough.
You'll also get more performance if you increase the blocks and reduce the threads per block, perhaps something like this:
my_diffs<<<8,128>>>(d_a, d_b, len);
The reason for this is that many GPUs have more than 1 or 2 SMs. To maximize perf on these GPUs with such a small data set, we want to try and get at least one block launched on each SM. Having more blocks in the grid makes this more likely.
If you want to fully parallelize the computation, the approach would be to create a 2D matrix (let's call it c[...]) in GPU memory, of square dimensions equal to the length of your vector. I would then create a 2D grid of threads, and have each thread perform the subtraction (a[row] - a[col]) and store it's result in c[row*len+col]. I would then launch a second (1D) kernel to sum the columns of c (each thread has a loop to sum a column) to create the result vector b. However I'm not sure this would be any faster than the approach I've outlined. Such a "more fully parallelized" approach also wouldn't lend itself as easily to the optimizations I discussed.

Selection Sort - Index of Min/Max

I was looking through the selection sort algorithm on cprogramming.com
and I think I found an error in the implementation.
If you work through the algorithm, there's a variable called "index_of_min" which I believe should be "index_of_max" (since when I tested it, it was sorting largest to smallest).
Thinking that it was a typo or a minor mistake, I checked out some other websites like wikipedia and some lesser known websites like geekpedia. It seems like they are call it index of min.
When I ran it through the debugger, it really seemed to me that it's the max value's index. Am I making a mistake somewhere?
Edit: As Svante pointed out, only the cprogramming implentation is wrong. Wikipedia and Geekpidia are fine.

The wikipedia and geekpedia sites seem to be correct, the cprogramming.com implementation actually has a bug; this:
if (array[index_of_min] < array[y])
{ index_of_min = y; }
has the order reversed, it should be:
if (array[y] < array[index_of_min])
{ index_of_min = y; }
Another fix would be to call the variable index_of_max, but I would expect a sorting algorithm to sort smallest to largest, and if this expectation is shared by the majority of programmers (as I presume), the principle of least astonishment rather demands the above fix.

I've only just read the code, but it looks like you're right: either index_of_min is misnamed or the comparison is backwards.
It isn't as strange as it might seem to see this error in several places. It's quite likely that each is copied from a single common source.

From Cprogramming.com "It works by selecting the smallest (or largest, if you want to sort from big to small) element of the array and placing it at the head of the array" So they have it sorting from large to small, the code isent wrong, nor is the variable naming, index_of_min keeps track of the starting point int the array (0) and then moves forward in that array. ie index_of_min keeps the smallest index value. Do not get it confused with whatever the value is at that index.

You are right. The code from that website (shown below) is incorrect.
for(int x = 0; x < n; x++)
{
int index_of_min = x;
for(int y = x; y < n; y++)
{
if(array[index_of_min] < array[y]) /* Here's the problem */
{
index_of_min = y;
}
}
int temp = array[x];
array[x] = array[index_of_min];
array[index_of_min] = temp;
}
At the end of the inner loop, for(int y=x; y<n; y++), the variable, index_of_min, holds the index of the maximum value. Assuming it was designed to sort the array from largest to smallest, this is a poorly named variable.
If you want the array sorted smallest to largest (as one would expect), you need to reverse the if statement:
if (array[y] < array[index_of_min])
{
index_of_min = y;
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Imnitializing a prmitive array - corrupting other arrays - xcode

Your array has 10 positions in both dimensions, but your loops go up to eleven. Try for(int x = 0; x < 10; x++) for(int y = 0; y < 10; y++) matrix[x][y] = 0; Notice the use of the lesser than comparator instead of lesser than or equal to.

I suppose you aren't declaring different variables for different matrices and are by mistake overwriting them.

Try It... matrix=[[NSArray alloc]init];

int myArray[10][10] = {}; THis will create the array and initialize all occurrences to 0;

Related

GLSL Vulkan compute shader to efficiently prune zeros from a shared array to make a compressed array

Initializing c++ arrays with references to variables and then values

What is the difference between a dangling pointer and memory leak?

How to parallelise a nested loop with cross element dependencies in cuda?

Selection Sort - Index of Min/Max

Categories

Resources