Insertion sort in C - insertion-sort

I implemented an insertion sort in C and someone who was helping me told my make something a pointer, as shown in the following line near the end, but why?
size_t size = sizeof( array ) / sizeof( *array );
Why is the second one a pointer to array, and what does size_t do?

sizeof(array) = size, in bytes, of the entire array;
sizeof(*array) = size, in bytes, of the first item in the array;
As items in a C array are of uniform size, dividing the first by the second gives the number of items in the array.
size_t is an unsigned integer large enough to store the size of any item the computer dan store in memory. So, usually, it's the same as an unsigned int, but it's not guaranteed to be and there's semantic value in it being a different thing.

Why is the second one a pointer to array
Example 1
char a[5];
sizeof(a)=5
sizeof(*a)=1
So, size = 5/1 = 5 // this indicates the no of elements in the array
Example 2
int a[5];
sizeof(a)= 20
sizeof(*a)=4
So, size = 20/4 = 5 // this indicates the no of elements in the array
and what does size_t do?
Read: What is size_t in C?

Related

why the difference between two pointers pointing to different elements of an array is the no of elements between these two pointers?

int main()
{
int arr[]={2,3,5,6,8};
int *ptr;
ptr=&arr[3];
cout<<ptr-arr;
}
Q.why the answer is 3 after compiling the code i.e. as it should be 3*sizeof(int) which in this case should be 3*4=12?
When you subtract pointers you get the distance between them, not the allocated size. The same goes for iterators in STL.
https://en.cppreference.com/w/cpp/language/operator_arithmetic#Additive_operators
The reason is that it is much easier to write correct code.
When the pointer difference between consecutive elements of an array is 1, then you can use ++p to walk through the array (assuming p is a pointer to an element). For example:
int a[10];
for (auto p = a, e = a + 10; p != e; ++p)
*p = 42;
Notice how the code does not have to deal with the size of the elements. If the array type changes from int to double, the code does not have to change and is still correct.

substraction : replace blank with zero

// this is a substraction example
int x=3098;
int z=3088;
int somme=x-z;
char buffer[4];
// convert int to char
itoa(somme,buffer,10);
// I want to push the buffer value on a char table like this "**0010**" not
// like "**10**"
Then you have to use a formater, standard ones in C are of printf family. Take care of the length of the array because if you want to store a string of length n you need an array of length n+1 (c-strings are Null-terminated). Thus:
// this is a substraction example
int x=3098;
int z=3088;
int somme=x-z;
char buffer[5];
sprintf(buffer,"%04d",somme);
will fit your needs. It means to format the integer somme as (%04d) decimal representation of length 4 padded with leading zeros if needed, and to store the result in memory starting at the beginning of buffer.

Selection Sort in Cuda

So, I'm trying to implement selection sort in Cuda, but so far I haven't been as successful.
__device__ void selection_sort( int *data, int left, int right ){
for( int i = left ; i <= right ; ++i ){
int min_val = data[i];
int min_idx = i;
// Find the smallest value in the range [left, right].
for( int j = i+1 ; j <= right ; ++j ){
int val_j = data[j];
if( val_j < min_val ){
min_idx = j;
min_val = val_j;
}
}
// Swap the values.
if( i != min_idx ){
data[min_idx] = data[i];
data[i] = min_val;
}
}
}
My main attempt here is to find the minimum and parallelize the solution. Now, I realize the code looks very C++ 'ish but I'm nowhere qualified as skilled in Cuda.
Is there a way to parallelize the solution? Are there any more additions to be made?
Selection sort algorithm for N numbers can be roughly described as:
for i from N-1 down to 0
find the maximum element among data[0] ~ data[i]
swap that maximum element with data[i] within the data array
The first part (finding the maximum element) falls into a widely known and well documented class of problems called reduction. However, to perform the second part (swapping), you must track the index of the maximum element while comparing the values, and it is not so natural to do that while performing reduction. This is one of the reasons why selection sort do not port well to parallel architectures.
Also, you can see that the problem size diminishes by one for each loop, and this is another aspect of the selection sort algorithm that does not map well to parallel architectures. In case of CUDA, 32 threads form a warp, which execute at the same time. Although you can tell arbitrary number of threads to run within a warp, it is generally not recommended to do so because it is a loss of computing power.
I've tried to build a CUDA version of selection sort myself, but I stopped doing it because it seems there are better algorithms well suited for CUDA. But I'll just show you what I've done so far to illustrate why selection sort is not good for CUDA.
Firstly, start from a small and simple problem: sorting 32 elements. Since 32 threads form a warp, you can use shuffle instructions to find maximum value. (Full code)
// Finds the maximum element within a warp and gives the maximum element to
// thread with lane id 0. Note that other elements do not get lost but their
// positions are shuffled.
__inline__ __device__ int warpMax(int data, unsigned int threadId)
{
for (int mask = 16; mask > 0; mask /= 2) {
int dual_data = __shfl_xor(data, mask, 32);
if (threadId & mask)
data = min(data, dual_data);
else
data = max(data, dual_data);
}
return data;
}
__global__ void selection32(int* d_data, int* d_data_sorted)
{
unsigned int threadId = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int laneId = threadIdx.x % 32;
int n = N;
while(n-- > 0) {
// get the maximum element among d_data and put it in d_data_sorted[n]
int data = d_data[threadId];
data = warpMax(data, threadId);
d_data[threadId] = data;
// now maximum element is in d_data[0]
if (laneId == 0) {
d_data_sorted[n] = d_data[0];
d_data[0] = INT_MIN; // this element is ignored from now on
}
}
}
int main()
{
// ... build data and trasfer to d_data ...
selection32<<<1, 32>>>(d_data, d_data_sorted);
// ... get the sorted array stored at d_data_sorted ...
}
(Some may argue that this is not exactly a selection sort since 1) the array elements of the unsorted area keep shuffling, and 2) it is not an in-place sort. Please note that I'm just trying to show that selection sort does not fit in for CUDA. Also, note that warpMax has highly divergent branches, making it less optimal for CUDA.)
The case with only 1 warp of elements may look parallel-ish, but the thing gets worse when the problem size increases to multiple warps. Let's see the case for 1024 elements. (I've chosen the number 1024 becuase it is the maximum number limit of threads in a block.) Now there are 32 warps, and after calling warpMax for each warp, we must compare the maximum elements of each warp to get the maximum element among the 1024 elements. This problem of comparing 32 warp-maximum-values cannot be done with warpMax because we need to track in which warp the maximum value came from to swap the maximum value with the last element in the data array. One way I can think of for doing this is using one single thread to compare warp-maximum-values. This is not a good implemenation for CUDA becuase other 1023 threads in the block become idle.
Furthermore, if the problem size grows larger than a block can cover, we need to compare the maximum values of each block, implying that we will have to launch separate kernels since we need to synchronize between blocks. And it is redundant to say that we need to keep track of in which block the maximum value came from. All of these just tells that implementing selection sort for CUDA is not a good idea.

warp shuffling to reduction of arrays with any length

I am working on a Cuda kernel which performs vector dot product (A x B). I assumed that the length of each vector is multiple of 32 (32,64, ...) and defined the block size to be equal to the length of the array. Each thread in the block multiplies one element of A to the corresponding element of B (thread i ==>psum = A[i]xB[i]). After multiplication, I used the following functions which used warp shuffling technique to perform reduction and calculate the sum all multiplications.
__inline__ __device__
float warpReduceSum(float val) {
int warpSize =32;
for (int offset = warpSize/2; offset > 0; offset /= 2)
val += __shfl_down(val, offset);
return val;
}
__inline__ __device__
float blockReduceSum(float val) {
static __shared__ int shared[32]; // Shared mem for 32 partial sums
int lane = threadIdx.x % warpSize;
int wid = threadIdx.x / warpSize;
val = warpReduceSum(val); // Each warp performs partial reduction
if (lane==0)
shared[wid]=val; // Write reduced value to shared memory
__syncthreads(); // Wait for all partial reductions
//read from shared memory only if that warp existed
val = (threadIdx.x < blockDim.x / warpSize) ? shared[lane] : 0;
if (wid==0)
val = warpReduceSum(val); // Final reduce within first warp
return val;
}
I simply call blockReduceSum(psum) which psum is the multiplication of two elements by a thread.
This approach doesn't work when the length of the array is not multiple of 32, so my question is, can we change this code so that it also works for any length? or is it impossible because if the length of the array is not multiple of 32, some warps have elements belonging more than one array?
First of all, depending on the GPU you are using, performing dot product with just 1 block will probably not be very efficient (as long as you are not batching several dot products in 1 kernel, each done by a single block).
To answer your question: you can reuse the code you have written by just calling your kernel with the number of threads being the closest multiple of 32 higher than N (length of the array) and introducing if statement before calling to blockReduceSum that would like this:
__global__ void kernel(float * A, float * B, int N) {
float psum = 0;
if(threadIdx.x < N) //threadIDx.x because your are using single block, you will need to change it to more general id once you move to multiple blocks
psum = A[threadIdx.x] * B[threadIdx.x];
blockReduceSum(psum);
//The rest of computation
}
That way, threads that do not have array element associated with them, but that need to be there due to use of __shfl, will contribute 0 to the sum.

Pointer arithmetics in C++ uses sizeof(type) incremention instead of byte incremention?

I am confused by the behavior of pointer arithmetics in C++. I have an array and I want to go N elements forward from the current one. Since in C++ pointer is memory address in BYTES, it seemed logical to me that the code would be newaddr = curaddr + N * sizeof(mytype). It produced errors though; later I found that with newaddr = curaddr + N everything works correctly. Why so? Should it really be address + N instead of address + N * sizeof?
Part of my code where I noticed it (2D array with all memory allocated as one chunk):
// creating pointers to the beginning of each line
if((Content = (int **)malloc(_Height * sizeof(int *))) != NULL)
{
// allocating a single memory chunk for the whole array
if((Content[0] = (int *)malloc(_Width * _Height * sizeof(int))) != NULL)
{
// setting up line pointers' values
int * LineAddress = Content[0];
int Step = _Width * sizeof(int); // <-- this gives errors, just "_Width" is ok
for(int i=0; i<_Height; ++i)
{
Content[i] = LineAddress; // faster than
LineAddress += Step; // Content[i] = Content[0] + i * Step;
}
// everything went ok, setting Width and Height values now
Width = _Width;
Height = _Height;
// success
return 1;
}
else
{
// insufficient memory available
// need to delete line pointers
free(Content);
return 0;
}
}
else
{
// insufficient memory available
return 0;
}
Your error in reasoning is right here: "Since in C++ pointer is memory address in BYTES, [...]".
A C/C++ pointer is not a memory address in bytes. Sure, it is represented by a memory address, but you have to differentiate between a pointer type and its representation. The operation "+" is defined for a type, not for its representation. Therefore, when it is called one the type int *, it respects the semantics of this type. Therefore, + 1 on an int * type will advance the pointer as much bytes as the underlying int type representation uses.
You can of course cast your pointer like this: (int)myPointer. Now you have a numeric type (instead of a pointer type), where + 1 will work as you would expect from a numeric type. Note that after this cast, the representation stays the same, but the type changes.
A "pointer" points to a location.
When you "increment", you want to go to the next, adjacent location.
Q: "Next" and "adjacent" depend on the size of the object you're pointing to, don't they?
Q: When you don't use "sizeof()", everything works, correct? Why? What do you think the compiler is doing for you, "behind your back"?
Q: What do you think should happen if you add your own "sizeof()"?
ON TOP OF the "everything works" scenario?
pointers point to addresses, so incrementing the pointer p by N will point to the Nth block of memory from p.
Now, if you were using addresses instead of pointers to addresses, then it would be appropriate to add N*sizeof(type).

Resources