Cannot use cuMemcpyHtoDAsync and cuMemcpyDtoHAsync at the same time

Cannot use cuMemcpyHtoDAsync and cuMemcpyDtoHAsync at the same time - performance

I have a rather strange observation on the following code snippet.
When I do both - copy memory to device and copy results back to host the streams seem to be synronized - i.e. they execute the kernel sequentially.
Once I remove the copy to host and keep copy the parameters to the device the streams execute in parallel,
once I remove copying the parameters and keep copying the results the streams also execute in parallel.
Any Idea why? and how to solve the problem?
for (int j=0; j<n_streams; j++) {
cuMemcpyHtoDAsync(gpu_parameters[j], parameters[j].asPointer(), (parameterCount) * Sizeof.FLOAT, stream[j]);
Pointer kernelParameters1 = Pointer.to(
Pointer.to(new int[]{0}),
Pointer.to(new int[] {10000}),
Pointer.to(gpu_data),
Pointer.to(gpu_results[j]),
Pointer.to(gpu_parameters[j])
);
cuLaunchKernel(function[j],
s_grid, 1, 1, // Grid dimension
s_block, 1, 1, // Block dimension
0, stream[j], // Shared memory size and stream
kernelParameters1, null // Kernel- and extra parameters
);
cuMemcpyDtoHAsync(results[j].asPointer(), gpu_results[j], (results[j].size()) * Sizeof.FLOAT, stream[j]);
}

No Idea why ... but changing the sequence removed the problem - and is executing in parallel....
for (int j=0; j<n_streams; j++) {
cuMemcpyHtoDAsync(gpu_parameters[j], parameters[j].asPointer(), (parameterCount) * Sizeof.FLOAT, stream[j]);
}
for (int j=0; j<n_streams; j++) {
Pointer kernelParameters1 = Pointer.to(
Pointer.to(new int[]{0}),
Pointer.to(new int[] {getNPrices()}),
Pointer.to(get_gpu_prices()),
Pointer.to(gpu_results[j]),
Pointer.to(gpu_parameters[j])
//,Pointer.to(new int[]{0})
);
cuLaunchKernel(function[j],
s_grid, 1, 1, // Grid dimension
s_block, 1, 1, // Block dimension
0, stream[j], // Shared memory size and stream
kernelParameters1, null // Kernel- and extra parameters
);
}
for (int j=0; j<n_streams; j++) {
cuMemcpyDtoHAsync(results[j].asPointer(), gpu_results[j], (results[j].size()) * Sizeof.FLOAT, stream[j]);
}

Related

Cannot understand hoow to recursively merge sort

Currently self-learning C++ with Daniel Liang's Introduction to C++.
On the topic of the merge sort, I cannot seem to understand how his code is recursively calling itself.
I understand the general concept of the merge sort, but I am having trouble understanding this code specifically.
In this example, we first pass the list 1, 7, 3, 4, 9, 3, 3, 1, 2, and its size (9) to the mergeSort function.
From there, we divide the list into two until the array size reaches 1. In this case, we would get: 1,7,3,4 -> 1,7 -> 1. We then move onto the merge sorting the second half. The second half array would be 7 in this case. We merge the two arrays [1] and [7] and proceed to delete the two arrays that were dynamically allocated to prevent any memory leak.
The part I don't understand is how does this code run from here? After delete[] firstHalf and delete[] secondHalf. From my understanding, shouldn't there be another mergeSort function call in order to merge sort the new firstHalf and secondHalf?
#include <iostream>
using namespace std;
// Function prototype
void arraycopy(int source[], int sourceStartIndex,
int target[], int targetStartIndex, int length);
void merge(int list1[], int list1Size,
int list2[], int list2Size, int temp[]);
// The function for sorting the numbers
void mergeSort(int list[], int arraySize)
{
if (arraySize > 1)
{
// Merge sort the first half
int* firstHalf = new int[arraySize / 2];
arraycopy(list, 0, firstHalf, 0, arraySize / 2);
mergeSort(firstHalf, arraySize / 2);
// Merge sort the second half
int secondHalfLength = arraySize - arraySize / 2;
int* secondHalf = new int[secondHalfLength];
arraycopy(list, arraySize / 2, secondHalf, 0, secondHalfLength);
mergeSort(secondHalf, secondHalfLength);
// Merge firstHalf with secondHalf
merge(firstHalf, arraySize / 2, secondHalf, secondHalfLength,
list);
delete [] firstHalf;
delete [] secondHalf;
}
}
void merge(int list1[], int list1Size,
int list2[], int list2Size, int temp[])
{
int current1 = 0; // Current index in list1
int current2 = 0; // Current index in list2
int current3 = 0; // Current index in temp
while (current1 < list1Size && current2 < list2Size)
{
if (list1[current1] < list2[current2])
temp[current3++] = list1[current1++];
else
temp[current3++] = list2[current2++];
}
while (current1 < list1Size)
temp[current3++] = list1[current1++];
while (current2 < list2Size)
temp[current3++] = list2[current2++];
}
void arraycopy(int source[], int sourceStartIndex,
int target[], int targetStartIndex, int length)
{
for (int i = 0; i < length; i++)
{
target[i + targetStartIndex] = source[i + sourceStartIndex];
}
}
int main()
{
const int SIZE = 9;
int list[] = {1, 7, 3, 4, 9, 3, 3, 1, 2};
mergeSort(list, SIZE);
for (int i = 0; i < SIZE; i++)
cout << list[i] << " ";
return 0;
}

From my understanding, shouldn't there be another mergeSort function
call in order to merge sort the new firstHalf and secondHalf?
It is happening implicitly during the recursive call. When you reach these two lines:
delete [] firstHalf;
delete [] secondHalf;
It means that one call to mergeSort is completed. If this call belongs to merging a first half, then code starts from the line after, i.e. these lines:
// Merge sort the second half
int secondHalfLength = arraySize - arraySize / 2;
...
But, if this call belongs to merging of the second half, then the control goes back to the line just after that call, i.e. these lines:
// Merge firstHalf with secondHalf
merge(firstHalf, arraySize / 2, secondHalf, secondHalfLength,
list);
And everything if doing well as planned.

Remove object from 2D Array- Processing

I'm creating a simple space invaders game. I'm looking to delete one of the invaders once they are hit by a bullet. The invaders are made up of a 2D array of images and I've tested the collision between the image and the bullet (in an ArrayList) and that works fine. So the game detects a collision, the next step is to delete the correct object that has been hit. I'm a little confused as to how to correctly correspond where the bullet hits to which object it has hit in the 2D array, and then deleting it from the Array and carrying on with the game.
Below is how I created the invader array in setup()
for(int i=0; i<2; i++){
for(int j=0; j<4; j++){
invArray[j][i]= new Taxi(taxiX, taxiY);
taxiX= taxiX+ 100;
}
taxiX=20;
taxiY= taxiY+ 140;
}
I then filled the 2D Array with images in draw()
for(int i=0; i<2; i++){
for(int j=0; j<4; j++){
invArray[j][i].update();
if(invArray[j][i].y>=600){
invArray[j][i].y= 0;
invArray[j][i].render();
}
}
}

You're using arrays which are fixed size.
In theory you might be able to use array helper functions like shorten() and expand(), but you really got watch your counters and array structure.
In practice, for a beginner, I would say this is error prone.
It might be simpler(but hackier) to set the array element of the hit invader to null,
then simply check if the invader is not null before test collisions/rendering/etc.
e.g. in draw():
for(int i=0; i<2; i++){
for(int j=0; j<4; j++){
if(invArray[j][i] != null){
invArray[j][i].update();
if(invArray[j][i].y>=600){
invArray[j][i].y= 0;
invArray[j][i].render();
}
}
}
}
Another option is to use an ArrayList which has a dynamic size.
e.g.
ArrayList<Taxi> invaders = new ArrayList<Taxi>();
In setup you'd do something similar:
for(int i=0; i<2; i++){
for(int j=0; j<4; j++){
invaders.add(new Taxi(taxiX, taxiY));
taxiX= taxiX+ 100;
}
taxiX=20;
taxiY= taxiY+ 140;
}
then in draw():
for(int i = 0 ; i < invaders.size(); i++){
Taxi t = invaders.get(i);
t.update();
if(t.y>=600){
t.y= 0;
t.render();
}
/*
if(YOUR_HIT_CONDITION_HERE){
invaders.remove(t);
}
*/
}
It's a bit tricky to go back and forth between 1D and 2D arrays/indexing at the beginning, but it's not that bad once you get the hand of it.
To convert from 2D x,y to 1D index:
int index = x + y * width;
(where x,y are you counters and width is the width of your grid (number of columns)).
The other way around, 1D index to 2D x,y:
int x = index % width;
int y = index / width;

Try to decouple the hit detection from removing elements from the arraylist, maybe using a flag and removing at the end on the draw loop. Use arraylist.size() as limit of the loop in one part of the code. Maybe that can solve your problem with hit detection, maybe you need a counter.

MPI Latency measuring

I am trying to understand some aspects of the MPI.
During the creation of the program, which is to measure latency between send/recv of two processes, I was faced with strange effects.
I tried to measure the result of many iterations, and received a response that matches the other benchmarks. Then I decided to display values after each iteration and was surprised: they ranged between four values that have not changed. I also drew attention to some very high values.
The code that calculates the value of latency and sample values is below:
int main()
{
MPI::Init();
Proc_Rank = MPI::COMM_WORLD.Get_rank();
for(int i = 0; i < 100; ++i)
latency_test(Proc_Rank, 1, 0);
MPI::Finalize();
return 0;
}
void latency_test(int Proc_Rank, int Iterations_Num, int Size)
{
double Total_Time, Latency;
double t1, t2;
char *Send_Buffer = new char[Size];
char *Recv_Buffer = new char[Size];
for(int i = 0; i < Size; i++){
Send_Buffer[i] = 'a';
}
for(int i = 0; i < Size; i++){
Recv_Buffer[i] = 'b';
}
MPI::COMM_WORLD.Barrier();
t1 = MPI::Wtime();
for(int i = 0; i < Iterations_Num; i++){
if (Proc_Rank == 0){
MPI::COMM_WORLD.Send(Send_Buffer, Size, MPI::CHAR, 1, 0);
MPI::COMM_WORLD.Recv(Recv_Buffer,Size,MPI::CHAR,1,
MPI::ANY_TAG);
}
else if (Proc_Rank==1) { MPI::COMM_WORLD.Recv(Recv_Buffer,Size,MPI::CHAR,0,MPI::ANY_TAG);
MPI::COMM_WORLD.Send(Send_Buffer, Size, MPI::CHAR, 0, 0);
}
}
t2 = MPI::Wtime();
delete []Send_Buffer;
delete []Recv_Buffer;
Total_Time = t2-t1;
if(Proc_Rank == 0){
Latency = (Total_Time / (Iterations_Num * 2.0)) * 1000000.0;
printf("%10.10f\n", Latency);
}
}
Part of the result:
5.4836273193
1.0728836060
0.9536743164
1.0728836060
0.4768371582
0.9536743164
0.5960464478
6.5565109253
0.9536743164
0.9536743164
1.0728836060
0.5960464478
0.4768371582
0.4768371582
Why are 4 fixed values randomly repeat? And why there are rare very large values?

As pointed out by Zulan, the resolution of the timer used by MPI_Wtime is not infinite. You can query the timer resolution by calling MPI_Wtick (MPI::Wtick in the C++ bindings). Measuring a single ping-pong round that lasts less than a microsecond is prone to very high statistical uncertainty, especially since the OS jitter, which is the random delay of the process execution due to other OS activities or processes being scheduled on the same CPU, could be several microseconds. No respectable MPI benchmark would do a single ping-pong round with empty messages.
As a side note, you are using a wildcard receive (MPI_ANY_TAG) in one of the processes. Those tend to be slower than fully-specified receives, especially when it comes to network equipment.

bootloader avr atmega128RFA1

I am also working on the bootloader.
I had the problem in the following:
Once the cmd 'B' is received, later, 'F' is received, then I would start to call block load.
static void start_block_flash_load(uint16_t size, uint32_t *addr) {
uint16_t data_word;
uint8_t sreg = SREG;
uint16_t temp;
int i;
uint8_t my_size;
fprintf(lcdout, "B");
cli();
// Disable interrupts
(*addr) <<= 1;
if (size <= SPM_PAGESIZE) {
boot_page_erase(*addr);
boot_spm_busy_wait();
fprintf(lcdout, "%"PRIu16, size);
uint16_t i;
//store all values. PROBLEM here!!!
my_size = 208;
uint8_t buf[SPM_PAGESIZE] = { 0 };
for (i = 0; i < my_size; i++) {
//for (i=0; i<size; i++){
buf[i] = uart_getc();
// lcd_clear();
// lcd_setCursor(0, 2);
// fprintf(lcdout, "%3d", i);
// _delay_ms(500);
}
for (i = 0; i < my_size; i += 2) { //if size is odd, then use do-while
uint16_t w = buf[i];
w += buf[i + 1] << 8; //first one is low byte, second is high???
boot_page_fill((*addr)+i, w);
}
boot_page_write(*addr);
boot_spm_busy_wait();
(*addr) >>= 1;
uart_putc('\r');
} else
uart_putc('?');
boot_rww_enable ();
SREG = sreg;
}
I can see on the lcd that the size of the block is 256. However, when entering the loop to collect data, it will get stuck.
I tested with my_size and I found that only if my_size=208 the program will run further.
The strange thing is that if I put some statements inside the loop, e.g.
lcd_clear();
lcd_setCursor(0, 2);
then 'i' which I printed out on lcd will not go up to 140 something. I put different statements, the 'i' will give different value. That is very strange, since the uart_getc() will not lose data.
What I expect is that the loop will go up to 256. I cannot figure out what happened there.
Please help if you have any idea.
Thanks

Passing parameter fail by CreateThread?

My Code like this
starts is a array of DWORD32
threads is a array of HANDLE
void initThreads(HANDLE* threads, int size)
{
DWORD32* starts = (DWORD32*)malloc(sizeof(DWORD32) * size);
for (int i = 0; i < size; ++i)
{
starts[i] = num_steps / numThreads * i;
}
for (int i = 0; i < size; ++i)
{
DWORD32* para = starts + i;
printf("create %ld\n", *para);
threads[i] = CreateThread(NULL, 0, portionCal, (void*)para, 0, NULL);
}
free(starts);
}
DWORD WINAPI portionCal(LPVOID pArg)
{
double x, portionSum = 0.0;
DWORD32 start = *(DWORD32*)pArg;
printf("start at %d\n", start);
}
But the result is
create 0
create 25000000
start at 0
create 50000000
create 75000000
start at 50000000
start at -17891602
start at 25000000
Why the result look like this?

We can't see the scope of starts but this can be guessed at from the failure. It is probably a local variable, long gone when the thread starts running. So you'll just read garbage. You'll need a stable pointer, get one from a global variable or malloc().
After edit: don't call free() like that. It has to remain stable until all threads have completed using it. You could consider reference counting it with InterlockedDecrement().

You free the starts array immediately after creating the threads. So what happens is that the threads are passed pointers to memory that may have been freed before the threads have a chance to read it. If that happens, the resulting behaviour is undefined.
You can resolve the problem by ensuring that the memory referred to by the pointers has a lifetime that extends beyond that of the threads. Typically you do that by allocating off the heap the data for each thread, and letting the thread call free when it has taken a copy of the information.
In this case, the easier way to resolve the problem is to pass the integer value rather than a pointer to it. Like this:
threads[i] = CreateThread(NULL, 0, portionCal, (void*)starts[i], 0, NULL);
And in your thread:
DWORD32 start = (DWORD32)pArg;

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Cannot use cuMemcpyHtoDAsync and cuMemcpyDtoHAsync at the same time - performance

Related

Cannot understand hoow to recursively merge sort

Remove object from 2D Array- Processing

MPI Latency measuring

bootloader avr atmega128RFA1

Passing parameter fail by CreateThread?

Categories

Resources