Fist, let me explain what I am implementing. The goal of my program is to generate all possible, non-distinct combinations of a given character set on a cuda enabled GPU. In order to parallelize the work, I am initializing each thread to a starting character.
For instance, consider the character set abcdefghijklmnopqrstuvwxyz. In this case, there will ideally be 26 threads: characterSet[threadIdx.x] = a for example (in practice, there would obviously be an offset to span the entire grid so that each thread has a unique identifier).
Here is my code thus far:
//Used to calculate grid dimensions
int* threads;
int* blocks;
int* tpb;
int charSetSize;
void calculate_grid_parameters(int length, int size, int* threads, int* blocks, int* tpb){
//Validate input
if(!threads || !blocks || ! tpb){
cout <<"An error has occured: Null pointer passed to function...\nPress enter to exit...";
const int maxBlocks = 65535; //Does not change
int maxThreads = 512; //Limit in order to provide more portability
int dev = 0;
int maxCombinations;
cudaDeviceProp deviceProp;
//Query device
//cudaGetDeviceProperties(&deviceProp, dev);
//maxThreads = deviceProp.maxThreadsPerBlock;
//Determine total threads to spawn
//Length of password * size of character set
//Each thread will handle part of the total number of the combinations
if(length > 3) length = 3; //Max length is 3
maxCombinations = length * size;
assert(maxCombinations < (maxThreads * maxBlocks));
It is fairly basic.
I've limited length to 3 for a specific reason. The full character set, abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 !\"#$&'()*+-.:;<>=?#[]^_{}~| is, I believe, 92 characters. This means for a length of 3, there are 778,688 possible non-distinct combinations. If it were length 4, than it would be roughly 71 million, and the maximum number of threads for my GPU is about 69 million (in one dimension). Furthermore, these combinations have already been generated in a file that will be read into an array and then delegated a specific initializing thread.
This leads me to my problem.
The maximum number of blocks on a cuda GPU (for 1-d) is 65,535. Each of those blocks (on my gpu) can run 1024 threads in one dimension. I've limited it to 512 in my code for portability purposes (this may be unnecessary). Ideally, each block should run 32 threads or a multiple of 32 threads in order to be efficient. The issue I have is how many threads I need. Like I said above, if I am using a full character set of length 3 for the starting values, this necessitates 778,688 threads. This happens to be divisible by 32, yielding 24,334 blocks assuming each block runs 32 threads. However, if I run the same character set with length two, I am left with 264.5 blocks each running 32 threads.
Basically, my character set is variable and the length of the initializing combinations is variable from 1-3.
If I round up to the nearest whole number, my offset index, tid = threadIdx.x + .... will be accessing parts of the array that simply do not exist.
How can I handle this problem in such a way that is will still run efficiently and not spawn unnecessary threads that could potentially cause memory problems?
Any constructive input is appreciated.

The code you've posted doesn't seem to do anything significant and includes no cuda code.
Your question appears to be this:
How can I handle this problem in such a way that is will still run efficiently and not spawn unnecessary threads that could potentially cause memory problems?
It's common practice when launching a kernel to "round up" to the nearest increment of threads, perhaps 32, perhaps some multiple of 32, so that an integral number of blocks can be launched. In this case, it's common practice to include a thread check in the kernel code, such as:
__global__ void mykernel(.... int size){
int idx=threadIdx.x + blockDim.x*blockIdx.x;
if (idx < size){
//main body of kernel code here
In this case, size is your overall problem size (the number of threads that you actually want). The overhead of the additional threads that are doing nothing is normally not a significant performance issue.


Num Threads trade-off in non-parallelizable work

I've been a good boy and parallelized my compute shader to execute 955 threads for 20 iterations
[numthreads(955, 1, 1)]
void main( uint3 pos : SV_DispatchThreadID )
for (uint i = 0; i < 20; i++)
//read from and write to groupshared memory
But this isn't going to work out (because the parallelization introduces a realtime delay) so I have to do it a less parallel way. The easy way to approach the problem is to have 20 threads doing 955 iterations each
[numthreads(20, 1, 1)]
void main( uint3 pos : SV_DispatchThreadID )
for (uint i = 0; i < 955; i++)
//read from and write to groupshared memory
However, I can't reason about how this is going to perform (probably terribly).
I under this new approach I must keep the number iterations the same, but can trade off the frequency which I call the compute shader with the number of threads. Which gives me two options:
Increase 20 -> 32 to have a full warp.
Increase 20 -> 32 * n to have warps running in parallel.
Maybe accessing groupshared memory is very cheap and so I don't have a performance problem in the first place.
Maybe I should try to optimize this on the cpu (I've already tried unoptimized and the performance was less than desired).
Someone commented on this answer
To be specific, a single-thread group will generally cap utilization to around 3-6%. Dispatching only one group compounds the issue, capping utilization to well under 1%. Sticking to 256 threads with power-of-two dimension sizes is a good rule of thumb, and you should dispatch at least 2048 or so threads total to keep the hardware busy.
and I decided that doing this work on the gpu is a stupid thing to do. It's always best to look for robust solutions.
The rubust solution for my problem is to use SIMD, which I will have to now learn the hard way.

CUDA Parallel Cross Product

Disclaimer: I am fairly new to CUDA and parallel programming - so if you're not going to bother to answer my question, just ignore this, or at least point me to the right resources so I can find the answer myself.
Here's the particular problem I'm looking to solve using parallel programming. I have some 1D arrays that store 3D vectors in this format -> [v0x, v0y, v0z, ... vnx, vny, vnz], where n is the vector, and x, y, z are the respective components.
Suppose I want to find the cross product between vectors [v0, v1, ... vn] in one array and their corresponding vectors [v0, v1, ... vn] in another array.
The calculation is pretty straightforward without parallelization:
result[x] = vec1[y]*vec2[z] - vec1[z]*vec2[y];
result[y] = vec1[z]*vec2[x] - vec1[x]*vec2[z];
result[z] = vec1[x]*vec2[y] - vec1[y]*vec2[x];
The problem I'm having is understanding how to implement CUDA parallelization for the arrays I currently have. Since each value in the result vector is a separate calculation, I can effectively run the above calculation for each vector in parallel. Since each component of the resulting cross product is a separate calculation, those too could run in parallel. How would I go about setting up the blocks and threads/ go about thinking about setting up the threads for such a problem?
The top 2 optimization priorities for any CUDA programmer are to use memory efficiently, and expose enough parallelism to hide latency. We'll use those to guide our algorithmic choices.
A very simple thread strategy (the thread strategy answers the question, "what will each thread do or be responsible for?") in any transformation (as opposed to reduction) type problem is to have each thread be responsible for 1 output value. Your problem fits the description of transformation - the output data set size is on the order of the input data set size(s).
I'll assume that you intended to have two equal length vectors containing your 3D vectors, and that you want to take the cross product of the first 3D vectors in each and the 2nd 3D vectors in each, and so on.
If we choose a thread strategy of 1 output point per thread (i.e. result[x] or result[y] or result[z], all together would be 3 output points), then we will need 3 threads to compute the output of each vector cross product. If we have enough vectors to multiply, then we will have enough threads to keep our machine "busy" and do a good job of hiding latency. As a rule of thumb, your problem will start to become interesting on GPUs if the number of threads is 10000 or more, so this means we would want your 1D vectors to consist of about 3000 3D vectors or more. Let's assume that is the case.
In order to tackle the memory efficiency objective, our first task is to load your vector data from global memory. We will want this ideally to be coalesced, which roughly means adjacent threads access adjacent elements in memory. We'll want the output stores to be coalesced also, and our thread strategy of choosing one output point/one vector component per thread will work nicely to support that.
For efficient memory usage, we'd like to ideally load each item from global memory only once. Your algorithm naturally involves a small amount of data reuse. The data reuse is evident since the computation of result[y] depends on vec2[z] and the computation of result[x] also depends on vec2[z] to pick just one example. Therefore a typical strategy when there is data reuse is to load the data first into CUDA shared memory, and then allow the threads to perform their computations based on the data in shared memory. As we will see, this makes it fairly easy/convenient for us to arrange for coalesced loads from global memory, since the global data load arrangement is no longer tightly coupled to the threads or the usage of the data for computation.
The last challenge is to figure out an indexing pattern so that each thread will select the proper elements from shared memory to multiply together. If we look at your calculation pattern that you have depicted in your question, we see that the first load from vec1 follows an offset pattern of +1(modulo 3) from the index that the result is being computed for. So x->y, y->z, and z -> x. Likewise we see a +2(modulo 3) for the next load from vec2, another +2(modulo 3) pattern for the next load from vec1 and another +1(modulo 3) pattern for the final load from vec2.
If we combine all these ideas, we can then write a kernel that should have generally efficient characteristics:
$ cat
#include <stdio.h>
#define TV1 1
#define TV2 2
const size_t N = 4096; // number of 3D vectors
const int blksize = 192; // choose as multiple of 3 and 32, and less than 1024
typedef float mytype;
//pairwise vector cross product
template <typename T>
__global__ void vcp(const T * __restrict__ vec1, const T * __restrict__ vec2, T * __restrict__ res, const size_t n){
__shared__ T sv1[blksize];
__shared__ T sv2[blksize];
size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
while (idx < 3*n){ // grid-stride loop
// load shared memory using coalesced pattern to global memory
sv1[threadIdx.x] = vec1[idx];
sv2[threadIdx.x] = vec2[idx];
// compute modulo/offset indexing for thread loads of shared data from vec1, vec2
int my_mod = threadIdx.x%3; // costly, but possibly hidden by global load latency
int off1 = my_mod+1;
if (off1 > 2) off1 -= 3;
int off2 = my_mod+2;
if (off2 > 2) off2 -= 3;
// each thread loads its computation elements from shared memory
T t1 = sv1[threadIdx.x-my_mod+off1];
T t2 = sv2[threadIdx.x-my_mod+off2];
T t3 = sv1[threadIdx.x-my_mod+off2];
T t4 = sv2[threadIdx.x-my_mod+off1];
// compute result, and store using coalesced pattern, to global memory
res[idx] = t1*t2-t3*t4;
idx += gridDim.x*blockDim.x;} // for grid-stride loop
int main(){
mytype *h_v1, *h_v2, *d_v1, *d_v2, *h_res, *d_res;
h_v1 = (mytype *)malloc(N*3*sizeof(mytype));
h_v2 = (mytype *)malloc(N*3*sizeof(mytype));
h_res = (mytype *)malloc(N*3*sizeof(mytype));
cudaMalloc(&d_v1, N*3*sizeof(mytype));
cudaMalloc(&d_v2, N*3*sizeof(mytype));
cudaMalloc(&d_res, N*3*sizeof(mytype));
for (int i = 0; i<N; i++){
h_v1[3*i] = TV1;
h_v1[3*i+1] = 0;
h_v1[3*i+2] = 0;
h_v2[3*i] = 0;
h_v2[3*i+1] = TV2;
h_v2[3*i+2] = 0;
h_res[3*i] = 0;
h_res[3*i+1] = 0;
h_res[3*i+2] = 0;}
cudaMemcpy(d_v1, h_v1, N*3*sizeof(mytype), cudaMemcpyHostToDevice);
cudaMemcpy(d_v2, h_v2, N*3*sizeof(mytype), cudaMemcpyHostToDevice);
vcp<<<(N*3+blksize-1)/blksize, blksize>>>(d_v1, d_v2, d_res, N);
cudaMemcpy(h_res, d_res, N*3*sizeof(mytype), cudaMemcpyDeviceToHost);
// verification
for (int i = 0; i < N; i++) if ((h_res[3*i] != 0) || (h_res[3*i+1] != 0) || (h_res[3*i+2] != TV1*TV2)) { printf("mismatch at %d, was: %f, %f, %f, should be: %f, %f, %f\n", i, h_res[3*i], h_res[3*i+1], h_res[3*i+2], (float)0, (float)0, (float)(TV1*TV2)); return -1;}
printf("%s\n", cudaGetErrorString(cudaGetLastError()));
return 0;
$ nvcc -o t1003
$ cuda-memcheck ./t1003
no error
========= ERROR SUMMARY: 0 errors
Note that I've chosen to write the kernel using a grid-stride loop. This isn't terribly important to this discussion, and not that relevant for this problem, because I've chosen a grid size equal to the problem size (4096*3). However for much larger problem sizes, you might choose a smaller grid size than the overall problem size, for some possible small efficiency gain.
For such a simple problem as this, it's fairly easy to define "optimality". The optimal scenario would be however long it takes to load the input data (just once) and write the output data. If we consider a larger version of the test code above, changing N to 40960 (and making no other changes), then the total data read and written would be 40960*3*4*3 bytes. If we profile that code and then compare to bandwidthTest as a proxy for peak achievable memory bandwidth, we observe:
$ CUDA_VISIBLE_DEVICES="1" nvprof ./t1003
==27861== NVPROF is profiling process 27861, command: ./t1003
no error
==27861== Profiling application: ./t1003
==27861== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 65.97% 162.22us 2 81.109us 77.733us 84.485us [CUDA memcpy HtoD]
30.04% 73.860us 1 73.860us 73.860us 73.860us [CUDA memcpy DtoH]
4.00% 9.8240us 1 9.8240us 9.8240us 9.8240us void vcp<float>(float const *, float const *, float*, unsigned long)
API calls: 99.10% 249.79ms 3 83.263ms 6.8890us 249.52ms cudaMalloc
0.46% 1.1518ms 96 11.998us 374ns 454.09us cuDeviceGetAttribute
0.25% 640.18us 3 213.39us 186.99us 229.86us cudaMemcpy
0.10% 255.00us 1 255.00us 255.00us 255.00us cuDeviceTotalMem
0.05% 133.16us 1 133.16us 133.16us 133.16us cuDeviceGetName
0.03% 71.903us 1 71.903us 71.903us 71.903us cudaLaunchKernel
0.01% 15.156us 1 15.156us 15.156us 15.156us cuDeviceGetPCIBusId
0.00% 7.0920us 3 2.3640us 711ns 4.6520us cuDeviceGetCount
0.00% 2.7780us 2 1.3890us 612ns 2.1660us cuDeviceGet
0.00% 1.9670us 1 1.9670us 1.9670us 1.9670us cudaGetLastError
0.00% 361ns 1 361ns 361ns 361ns cudaGetErrorString
$ CUDA_VISIBLE_DEVICES="1" /usr/local/cuda/samples/bin/x86_64/linux/release/bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: Tesla K20Xm
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 6375.8
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 6554.3
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 171220.3
Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
The kernel takes 9.8240us to execute, and in that time loads or stores a total of 40960*3*4*3 bytes of data. Therefore the achieved memory bandwidth by the kernel is 40960*3*4*3/0.000009824 or 150 GB/s. The proxy measurement for peak achievable on this GPU is 171 GB/s, so this kernel achieves 88% of the optimal throughput. With more careful benchmarking to run the kernel twice in a row, the 2nd execution requires only 8.99us to execute. This brings the achieved bandwidth in this case up to 96% of peak achievable throughput.

Fill device array consecutively in CUDA

(This might be more of a theoretical parallel optimization problem then a CUDA specific problem per se. I'm very new to Parallel Programming in general so this may just be personal ignorance.)
I have a workload that consists of a 64-bit binary numbers upon which I run analysis. If the analysis completes successfully then that binary number is a "valid solution". If the analysis breaks midway then the number is "invalid". The end goal is to get a list of all the valid solutions.
Now there are many trillions of 64 bit binary numbers I am analyzing, but only ~5% or less will be valid solutions, and they usually come in bunches (i.e. every consecutive 1000 numbers are valid and then every random billion or so are invalid). I can't find a pattern to the space between bunches so I can't ignore the large chunks of invalid solutions.
Currently, every thread in a kernel call analyzes just one number. If the number is valid it denotes it as such in it's respective place on a device array. Ditto if it's invalid. So basically I generate a data point for very value analyzed regardless if it's valid or not. Then once the array is full I copy it to host only if a valid solution was found (denoted by a flag on the device). With this, overall throughput is greatest when the array is the same size as the # of threads in the grid.
But Copying Memory to & from the GPU is expensive time wise. That said what I would like to do is copy data over only when necessary; I want to fill up a device array with only valid solutions and then once the array is full then copy it over from the host. But how do you consecutively fill an array up in a parallel environment? Or am I approaching this problem the wrong way?
This is the Kernel I initially developed. As you see I am generating 1 byte of data for each value analyzed. Now I really only need each 64 bit number which is valid; if I need be I can make a new kernel. As suggested by some of the commentators I am currently looking into stream compaction.
__global__ void kValid(unsigned long long*kInfo, unsigned char*values, char *solutionFound) {
//a 64 bit binary value to be evaluated is called a kValue
unsigned long long int kStart, kEnd, kRoot, kSize, curK;
//kRoot is the kValue at the start of device array, this is used is the device array is larger than the total threads in the grid
//kStart is the kValue to start this kernel call on
//kEnd is the last kValue to validate
//kSize is how many bits long is kValue (we don't necessarily use all 64 bits but this value stays constant over the entire chunk of values defined on the host
//curK is the current kValue represented as a 64 bit unsigned integer
int rowCount, kBitLocation, kMirrorBitLocation, row, col, nodes, edges;
kStart = kInfo[0];
kEnd = kInfo[1];
kRoot = kInfo[2];
nodes = kInfo[3];
edges = kInfo[4];
kSize = kInfo[5];
curK = blockIdx.x*blockDim.x + threadIdx.x + kStart;
if (curK > kEnd) {//check to make sure you don't overshoot the end value
kBitLocation = 1;//assuming the first bit in the kvalue has a position 1;
for (row = 0; row < nodes; row++) {
rowCount = 0;
kMirrorBitLocation = row;//the bit position for the mirrored kvals is always starts at the row value (assuming the first row has a position of 0)
for (col = 0; col < nodes; col++) {
if (col > row) {
if (curK & (1 << (unsigned long long int)(kSize - kBitLocation))) {//add one to kIterator to convert to counting space
if (col < row) {
if (col > 0) {
kMirrorBitLocation += (nodes - 2) - (col - 1);
if (curK & (1 << (unsigned long long int)(kSize - kMirrorBitLocation))) {//if bit is set
if (rowCount != edges) {
//set the ith bit to zero
values[curK - kRoot] = 0;
//set the ith bit to one
values[curK - kRoot] = 1;
*solutionFound = 1; //not a race condition b/c it will only ever be set to 1 by any thread.
(This answer assumes output order is inconsequential and so are the positions of the valid values.)
Conceptually, your analysis produces a set of valid values. The implementation you described uses a dense representation of this set: One bit for every potential value. Yet you've indicated that the data is quite sparse (either 5e-2 or 1000/10^9 = 1e-6); moreover, copying data across PCI express is quite a pain.
Well, then, why not consider a sparse representation? The simplest one would be merely an unordered sequence of the valid values. Of course, writing that requires some synchronization across threads - perhaps even across blocks. Roughly, you can have warps collect their valid values in shared memory; then synchronize at the block level to collect the block's valid values (for a given chunk of the input it has analyzed); and finally use atomics to collect the data from all the blocks.
Oh, also - have each thread analyze multiple values, so you don't have to do that much synchronization.
So, you would want to have each thread analyze multiple numbers (thousands or millions) before you do a return from the computation. So if you analyze a million numbers in your thread, you will only need %5 of that amount of space to possible hold the results of that computation.

Strange pointer arithmetic

I came across too strange behaviour of pointer arithmetic. I am developing a program to develop SD card from LPC2148 using ARM GNU toolchain (on Linux). My SD card a sector contains data (in hex) like (checked from linux "xxd" command):
fe 2a 01 34 21 45 aa 35 90 75 52 78
While printing individual byte, it is printing perfectly.
char *ch = buffer; /* char buffer[512]; */
for(i=0; i<12; i++)
debug("%x ", *ch++);
Here debug function sending output on UART.
However pointer arithmetic specially adding a number which is not multiple of 4 giving too strange results.
uint32_t *p; // uint32_t is typedef to unsigned long.
p = (uint32_t*)((char*)buffer + 0);
debug("%x ", *p); // prints 34012afe // correct
p = (uint32_t*)((char*)buffer + 4);
debug("%x ", *p); // prints 35aa4521 // correct
p = (uint32_t*)((char*)buffer + 2);
debug("%x ", *p); // prints 0134fe2a // TOO STRANGE??
Am I choosing any wrong compiler option? Pls help.
I tried optimization options -0 and -s; but no change.
I could think of little/big endian, but here i am getting unexpected data (of previous bytes) and no order reversing.
Your CPU architecture must support unaligned load and store operations.
To the best of my knowledge, it doesn't (and I've been using STM32, which is an ARM-based cortex).
If you try to read a uint32_t value from an address which is not divisible by the size of uint32_t (i.e. not divisible by 4), then in the "good" case you will just get the wrong output.
I'm not sure what's the address of your buffer, but at least one of the three uint32_t read attempts that you describe in your question, requires the processor to perform an unaligned load operation.
On STM32, you would get a memory-access violation (resulting in a hard-fault exception).
The data-sheet should provide a description of your processor's expected behavior.
Even if your processor does support unaligned load and store operations, you should try to avoid using them, as it might affect the overall running time (in comparison with "normal" load and store operations).
So in either case, you should make sure that whenever you perform a memory access (read or write) operation of size N, the target address is divisible by N. For example:
uint08_t x = *(uint08_t*)y; // 'y' must point to a memory address divisible by 1
uint16_t x = *(uint16_t*)y; // 'y' must point to a memory address divisible by 2
uint32_t x = *(uint32_t*)y; // 'y' must point to a memory address divisible by 4
uint64_t x = *(uint64_t*)y; // 'y' must point to a memory address divisible by 8
In order to ensure this with your data structures, always define them so that every field x is located at an offset which is divisible by sizeof(x). For example:
uint16_t a; // offset 0, divisible by sizeof(uint16_t), which is 2
uint08_t b; // offset 2, divisible by sizeof(uint08_t), which is 1
uint08_t a; // offset 3, divisible by sizeof(uint08_t), which is 1
uint32_t c; // offset 4, divisible by sizeof(uint32_t), which is 4
uint64_t d; // offset 8, divisible by sizeof(uint64_t), which is 8
Please note, that this does not guarantee that your data-structure is "safe", and you still have to make sure that every myStruct_t* variable that you are using, is pointing to a memory address divisible by the size of the largest field (in the example above, 8).
There are two basic rules that you need to follow:
Every instance of your structure must be located at a memory address which is divisible by the size of the largest field in the structure.
Each field in your structure must be located at an offset (within the structure) which is divisible by the size of that field itself.
Rule #1 may be violated if the CPU architecture supports unaligned load and store operations. Nevertheless, such operations are usually less efficient (requiring the compiler to add NOPs "in between"). Ideally, one should strive to follow rule #1 even if the compiler does support unaligned operations, and let the compiler know that the data is well aligned (using a dedicated #pragma), in order to allow the compiler to use aligned operations where possible.
Rule #2 may be violated if the compiler automatically generates the required padding. This, of course, changes the size of each instance of the structure. It is advisable to always use explicit padding (instead of relying on the current compiler, which may be replaced at some later point in time).
LDR is the ARM instruction to load data. You have lied to the compiler that the pointer is a 32bit value. It is not aligned properly. You pay the price. Here is the LDR documentation,
If the address is not word-aligned, the loaded value is rotated right by 8 times the value of bits [1:0].
See: 4.2.1. LDR and STR, words and unsigned bytes, especially the section Address alignment for word transfers.
Basically your code is like,
p = (uint32_t*)((char*)buffer + 0);
p = (p>>16)|(p<<16);
debug("%x ", *p); // prints 0134fe2a
but has encoded to one instruction on the ARM. This behavior is dependent on the ARM CPU type and possibly co-processor values. It is also highly non-portable code.
It's called "undefined behavior". Your code is casting a value which is not a valid unsigned long * into an unsigned long *. The semantics of that operation are undefined behavior, which means pretty much anything can happen*.
In this case, the reason two of your examples behaved as you expected is because you got lucky and buffer happened to be word-aligned. Your third example was not as lucky (if it was, the other two would not have been), so you ended up with a pointer with extra garbage in the 2 least significant bits. Depending on the version of ARM you are using, that could result in an unaligned read (which it appears is what you were hoping for), or it could result in an aligned read (using the most significant 30 bits) and a rotation (word rotated by the number of bytes indicated in the least significant 2 bits). It looks pretty clear that the later is what happened in your 3rd example.
Anyway, technically, all 3 of your example outputs are correct. It would also be correct for the program to crash on all 3 of them.
Basically, don't do that.
A safer alternative is to write the bytes into a uint32_t. Something like:
uint32_t w;
memcpy(&w, buffer, 4);
debug("%x ", w);
memcpy(&w, buffer+4, 4);
debug("%x ", w);
memcpy(&w, buffer+2, 4);
debug("%x ", w);
Of course, that's still assuming sizeof(uint32_t) == 4 && CHAR_BITS == 8, but that's a much safer assumption. (Ie, it should work on pretty much any machine with 8 bit bytes.)

Fast way to find the maximum of a float array in OpenCL

I'm having trouble with the simple task of finding the maximum of an array in OpenCL.
__kernel void ndft(/* lots of stuff*/)
size_t thread_id = get_global_id(0); // thread_id = [0 .. spectrum_size[
// Now I have float spectrum_abs[spectrum_size] and
// I want the maximum as well as the index holding the maximum
// this is the old, sequential code:
if (*current_max_value < spectrum_abs[i])
*current_max_value = spectrum_abs[i];
*current_max_freq = i;
Now I could add if (thread_id == 0) and loop through the entire thing as I would do on a single core system, but since performance is a critical issue (otherwise I wouldn't be doing spectrum calculations on a GPU), is there a faster way to do that?
Returning to the CPU at the end of the kernel above is not an option, because the kernel actually continues after that.
You will need to write a parallel reduction. Split your "large" array into small pieces (a size a single workgroup can effectively process) and compute the min-max in each.
Do this iteratively (involves both host and device code) till you are left with only one set of min/max values.
Note that you might need to write a separate kernel that does this unless the current work-distribution works for this piece of the kernel (see my question to you above).
An alternative if your current work distribution is amenable is to find the min max inside of each workgroup and write it to a buffer in global memory (index = local_id). After a barrier(), simply make the kernel running on thread_id == 0 loop across the reduced results and find the max in it. This will not be the optimal solution, but might be one that fits inside your current kernel.
