Minimize cudaDeviceSynchronize launch overhead

Minimize cudaDeviceSynchronize launch overhead - parallel-processing

I'm currently doing a project with CUDA where a pipeline is refreshed with 200-10000 new events every 1ms. Each time, I want to call one(/two) kernels which compute a small list of outputs; then fed those outputs to the next element of the pipeline.
The theoretical flow is:
receive data in an std::vector
cudaMemcpy the vector to GPU
processing
generate small list of outputs
cudaMemcpy to the output std::vector
But when I'm calling cudaDeviceSynchronize on a 1block/1thread empty kernel with no processing, it already takes in average 0.7 to 1.4ms, which is already higher than my 1ms timeframe.
I could eventually change the timeframe of the pipeline in order to receive events every 5ms, but with 5x more each times. It wouldn't be ideal though.
What would be the best way to minimize the overhead of cudaDeviceSynchronize? Could streams be helpful in this situation? Or another solution to efficiently run the pipeline.
(Jetson TK1, compute capabilities 3.2)
Here's a nvprof log of the applications:
==8285== NVPROF is profiling process 8285, command: python player.py test.rec
==8285== Profiling application: python player.py test.rec
==8285== Profiling result:
Time(%) Time Calls Avg Min Max Name
94.92% 47.697ms 5005 9.5290us 1.7500us 13.083us reset_timesurface(__int64, __int64*, __int64*, __int64*, __int64*, float*, float*, bool*, bool*, Event*)
5.08% 2.5538ms 8 319.23us 99.750us 413.42us [CUDA memset]
==8285== API calls:
Time(%) Time Calls Avg Min Max Name
75.00% 5.03966s 5005 1.0069ms 25.083us 11.143ms cudaDeviceSynchronize
17.44% 1.17181s 5005 234.13us 83.750us 3.1391ms cudaLaunch
4.71% 316.62ms 9 35.180ms 23.083us 314.99ms cudaMalloc
2.30% 154.31ms 50050 3.0830us 1.0000us 2.6866ms cudaSetupArgument
0.52% 34.857ms 5005 6.9640us 2.5000us 464.67us cudaConfigureCall
0.02% 1.2048ms 8 150.60us 71.917us 183.33us cudaMemset
0.01% 643.25us 83 7.7490us 1.3330us 287.42us cuDeviceGetAttribute
0.00% 12.916us 2 6.4580us 2.0000us 10.916us cuDeviceGetCount
0.00% 5.3330us 1 5.3330us 5.3330us 5.3330us cuDeviceTotalMem
0.00% 4.0830us 1 4.0830us 4.0830us 4.0830us cuDeviceGetName
0.00% 3.4160us 2 1.7080us 1.5830us 1.8330us cuDeviceGet
A small reconstitution of the program (nvprof log at the end) - for some reason, the average of cudaDeviceSynchronize is 4 times lower, but it's still really high for an empty 1-thread kernel:
/* Compile with `nvcc test.cu -I.`
* with -I pointing to "helper_cuda.h" and "helper_string.h" from CUDA samples
**/
#include <iostream>
#include <cuda.h>
#include <helper_cuda.h>
#define MAX_INPUT_BUFFER_SIZE 131072
typedef struct {
unsigned short x;
unsigned short y;
short a;
long long b;
} Event;
long long *d_a_[2], *d_b_[2];
float *d_as_, *d_bs_;
bool *d_some_bool_[2];
Event *d_data_;
int width_ = 320;
int height_ = 240;
__global__ void reset_timesurface(long long ts,
long long *d_a_0, long long *d_a_1,
long long *d_b_0, long long *d_b_1,
float *d_as, float *d_bs,
bool *d_some_bool_0, bool *d_some_bool_1, Event *d_data) {
// nothing here
}
void reset_errors(long long ts) {
static const int n = 1024;
static const dim3 grid_size(width_ * height_ / n
+ (width_ * height_ % n != 0), 1, 1);
static const dim3 block_dim(n, 1, 1);
reset_timesurface<<<1, 1>>>(ts, d_a_[0], d_a_[1],
d_b_[0], d_b_[1],
d_as_, d_bs_,
d_some_bool_[0], d_some_bool_[1], d_data_);
cudaDeviceSynchronize();
// static long long *h_holder = (long long*)malloc(sizeof(long long) * 2000);
// cudaMemcpy(h_holder, d_a_[0], 0, cudaMemcpyDeviceToHost);
}
int main(void) {
checkCudaErrors(cudaMalloc(&(d_a_[0]), sizeof(long long)*width_*height_*2));
checkCudaErrors(cudaMemset(d_a_[0], 0, sizeof(long long)*width_*height_*2));
checkCudaErrors(cudaMalloc(&(d_a_[1]), sizeof(long long)*width_*height_*2));
checkCudaErrors(cudaMemset(d_a_[1], 0, sizeof(long long)*width_*height_*2));
checkCudaErrors(cudaMalloc(&(d_b_[0]), sizeof(long long)*width_*height_*2));
checkCudaErrors(cudaMemset(d_b_[0], 0, sizeof(long long)*width_*height_*2));
checkCudaErrors(cudaMalloc(&(d_b_[1]), sizeof(long long)*width_*height_*2));
checkCudaErrors(cudaMemset(d_b_[1], 0, sizeof(long long)*width_*height_*2));
checkCudaErrors(cudaMalloc(&d_as_, sizeof(float)*width_*height_*2));
checkCudaErrors(cudaMemset(d_as_, 0, sizeof(float)*width_*height_*2));
checkCudaErrors(cudaMalloc(&d_bs_, sizeof(float)*width_*height_*2));
checkCudaErrors(cudaMemset(d_bs_, 0, sizeof(float)*width_*height_*2));
checkCudaErrors(cudaMalloc(&(d_some_bool_[0]), sizeof(bool)*width_*height_*2));
checkCudaErrors(cudaMemset(d_some_bool_[0], 0, sizeof(bool)*width_*height_*2));
checkCudaErrors(cudaMalloc(&(d_some_bool_[1]), sizeof(bool)*width_*height_*2));
checkCudaErrors(cudaMemset(d_some_bool_[1], 0, sizeof(bool)*width_*height_*2));
checkCudaErrors(cudaMalloc(&d_data_, sizeof(Event)*MAX_INPUT_BUFFER_SIZE));
for (int i = 0; i < 5005; ++i)
reset_errors(16487L);
cudaFree(d_a_[0]);
cudaFree(d_a_[1]);
cudaFree(d_b_[0]);
cudaFree(d_b_[1]);
cudaFree(d_as_);
cudaFree(d_bs_);
cudaFree(d_some_bool_[0]);
cudaFree(d_some_bool_[1]);
cudaFree(d_data_);
cudaDeviceReset();
}
/* nvprof ./a.out
==9258== NVPROF is profiling process 9258, command: ./a.out
==9258== Profiling application: ./a.out
==9258== Profiling result:
Time(%) Time Calls Avg Min Max Name
92.64% 48.161ms 5005 9.6220us 6.4160us 13.250us reset_timesurface(__int64, __int64*, __int64*, __int64*, __int64*, float*, float*, bool*, bool*, Event*)
7.36% 3.8239ms 8 477.99us 148.92us 620.17us [CUDA memset]
==9258== API calls:
Time(%) Time Calls Avg Min Max Name
53.12% 1.22036s 5005 243.83us 9.6670us 8.5762ms cudaDeviceSynchronize
25.10% 576.78ms 5005 115.24us 44.250us 11.888ms cudaLaunch
9.13% 209.77ms 9 23.308ms 16.667us 208.54ms cudaMalloc
6.56% 150.65ms 1 150.65ms 150.65ms 150.65ms cudaDeviceReset
5.33% 122.39ms 50050 2.4450us 833ns 6.1167ms cudaSetupArgument
0.60% 13.808ms 5005 2.7580us 1.0830us 104.25us cudaConfigureCall
0.10% 2.3845ms 9 264.94us 22.333us 537.75us cudaFree
0.04% 938.75us 8 117.34us 58.917us 169.08us cudaMemset
0.02% 461.33us 83 5.5580us 1.4160us 197.58us cuDeviceGetAttribute
0.00% 15.500us 2 7.7500us 3.6670us 11.833us cuDeviceGetCount
0.00% 7.6670us 1 7.6670us 7.6670us 7.6670us cuDeviceTotalMem
0.00% 4.8340us 1 4.8340us 4.8340us 4.8340us cuDeviceGetName
0.00% 3.6670us 2 1.8330us 1.6670us 2.0000us cuDeviceGet
*/

As detailled in the comments of the original message, my problem was entirely related to the GPU I'm using (Tegra K1). Here's an answer I found for this particular problem; it might be useful for other GPUs as well. The average for cudaDeviceSynchronize on my Jetson TK1 went from 250us to 10us.
The rate of the Tegra was 72000kHz by default, we'll have to set it to 852000kHz using this command:
$ echo 852000000 > /sys/kernel/debug/clock/override.gbus/rate
$ echo 1 > /sys/kernel/debug/clock/override.gbus/state
We can find the list of available frequency using this command:
$ cat /sys/kernel/debug/clock/gbus/possible_rates
72000 108000 180000 252000 324000 396000 468000 540000 612000 648000 684000 708000 756000 804000 852000 (kHz)
More performance can be obtained (again, in exchange for a higher power draw) on both the CPU and GPU; check this link for more informations.

Related

ESP32 not functioning WIFI

I have a Frankenstein ESP 32 setup using my dead Wemos Lolin 32 board and an external Ai thinker ESP 32 chip.
It has been working alright until recently it's failing to connect to any Wifi while dumping garbage data for anything beyond the Wifi. begin() function. It occasionally connects in roughly 1 of 15 reboots. It also seems to sometimes stall UART before triggering a reboot and then working.
I have been using it with a GC9A01 1.28-inch TFT display. I may have damaged it while making the connections but I can't be sure. If I don't use Wifi most other functionality seems to work ok.
This is a sample of the many codes I have tried.
#include <Arduino.h>
#include <WiFi.h>
#include "time.h"
const char *ssid = "VUMA FIBER ";
const char *password = "mysecurepassword";
const char *ntpServer = "pool.ntp.org";
const long gmtOffset_sec = 3 * 3600;
const int daylightOffset_sec = 0;
hw_timer_t *My_timer = NULL;
class Timehandler
{
private:
hw_timer_t *My_timer = NULL;
uint8_t counter = 0;
public:
uint8_t hour;
uint8_t minutes;
uint8_t seconds;
unsigned int year;
uint8_t date;
uint8_t day;
struct tm timeinfo;
bool fetchtime(uint8_t gmtOffset = 0, uint8_t daylightOffset_sec = 0, const char *ntpServer = "pool.ntp.org")
{
// configTime(gmtOffset * 3600, daylightOffset_sec, ntpServer);
// struct tm timeinfo;
// if (!getLocalTime(&timeinfo)) {
// return false;
// } else {
Serial.println(&timeinfo, "%A, %B %d %Y %H:%M:%S");
return true;
// }
}
void maintainTime()
{
seconds++;
if (seconds >= 60)
{
minutes++;
seconds = 0;
}
if (minutes >= 60)
{
hour++;
minutes = 0;
}
if (hour >= 24)
{
date++;
hour = 0;
}
Serial.printf("%02d", hour);
Serial.print(":");
Serial.printf("%02d", minutes);
Serial.print(":");
Serial.printf("%02d", seconds);
Serial.println("");
}
void getTime()
{
struct tm timeinfo;
if (!getLocalTime(&timeinfo))
{
Serial.println("Failed to obtain time");
return;
}
hour = timeinfo.tm_hour;
minutes = timeinfo.tm_min;
seconds = timeinfo.tm_sec;
year = timeinfo.tm_year;
date = timeinfo.tm_mday;
day = timeinfo.tm_wday;
}
};
bool connected = false;
Timehandler t;
void setup()
{
Serial.begin(115200);
// connect to WiFi
Serial.print("Connecting to ");
Serial.println(ssid);
WiFi.begin(ssid, password);
t.fetchtime();
t.getTime();
}
void loop()
{
if (!connected)
{
Serial.println("failed...retrying");
if (WiFi.status() == WL_CONNECTED)
{
Serial.println(" CONNECTED");
while (!t.fetchtime(3))
{
Serial.println("failed...retrying");
delay(500);
t.fetchtime(3);
WiFi.disconnect(true);
WiFi.mode(WIFI_OFF);
}
t.getTime();
connected = true;
}
}else{
t.maintainTime();
}
t.maintainTime();
delay(1000);
}
Here is a sample of the serial output.
CURRENT: upload_protocol = esptool
Looking for upload port...
Auto-detected: COM5
Uploading .pio\build\lolin32\firmware.bin
esptool.py v3.1
Serial port COM5
Connecting....
Chip is ESP32-D0WD (revision 1)
Features: WiFi, BT, Dual Core, 240MHz, VRef calibration in efuse, Coding Scheme None
Crystal is 40MHz
MAC: 94:3c:c6:10:4b:30
Uploading stub...
Running stub...
Stub running...
Changing baud rate to 460800
Changed.
Configuring flash size...
Auto-detected Flash size: 4MB
Flash will be erased from 0x00001000 to 0x00005fff...
Flash will be erased from 0x00008000 to 0x00008fff...
Flash will be erased from 0x0000e000 to 0x0000ffff...
Flash will be erased from 0x00010000 to 0x000abfff...
Compressed 17120 bytes to 11164...
Writing at 0x00001000... (100 %)
Wrote 17120 bytes (11164 compressed) at 0x00001000 in 0.6 seconds (effective 213.4 kbit/s)...
Hash of data verified.
Compressed 3072 bytes to 128...
Writing at 0x00008000... (100 %)
Wrote 3072 bytes (128 compressed) at 0x00008000 in 0.1 seconds (effective 247.8 kbit/s)...
Hash of data verified.
Compressed 8192 bytes to 47...
Writing at 0x0000e000... (100 %)
Wrote 8192 bytes (47 compressed) at 0x0000e000 in 0.2 seconds (effective 397.7 kbit/s)...
Hash of data verified.
Compressed 638784 bytes to 391514...
Writing at 0x00010000... (4 %)
Writing at 0x0001bce6... (8 %)
Writing at 0x00029453... (12 %)
Writing at 0x00031fdd... (16 %)
Writing at 0x00037279... (20 %)
Writing at 0x0003c526... (25 %)
Writing at 0x000419cd... (29 %)
Writing at 0x00046f5c... (33 %)
Writing at 0x0004c242... (37 %)
Writing at 0x000533e8... (41 %)
Writing at 0x0005ba7c... (45 %)
Writing at 0x00061195... (50 %)
Writing at 0x00066a45... (54 %)
Writing at 0x0006bdb0... (58 %)
Writing at 0x000715e8... (62 %)
Writing at 0x00077550... (66 %)
Writing at 0x0007cdc5... (70 %)
Writing at 0x000830c2... (75 %)
Writing at 0x00088d20... (79 %)
Writing at 0x0008ebd9... (83 %)
Writing at 0x00094b83... (87 %)
Writing at 0x0009ab4a... (91 %)
Writing at 0x000a0842... (95 %)
Writing at 0x000a6686... (100 %)
Wrote 638784 bytes (391514 compressed) at 0x00010000 in 10.5 seconds (effective 486.4 kbit/s)...
Hash of data verified.
Leaving...
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:1
load:0x3fff0018,len:4
load:0x3fff001c,len:1044load:0x40078000,len:10124
load:0x40080400,len:5856
entry 0x400806a8
Connecting to VUMA FIBER
���n���b�l�|�␒b␒␂␌␌␌ll␌␌␌␌␌�␌␌�␌␌l`␃␜␒␒nn�␐␂␌�n�np�␒␒nn␌��r��`␃␜␒␒no�␐�bbb|␒�b␒␒on�␎l�␌�␌␌�␌ll`␃␜␒␒oo�␐�bbc|␒�b␓␒nn�␎l�␌�␌␌�␌�l`␃␜␒␒no�␐�bbb|␒�b␒␒nn�␎l�␌�␌␌�␌�l`␃␒␒nn�␐�ccc|␒�b␒␒nn�␎l�␌�␌␌�␌␌�␎l␜␒␒on�␐�cbc|␒�b␒␒on�␎l�␌�␌␌�␌l�␎l␜␒␒oo�␐�ccc|␒�b␓␒oo�␎l�␌�␌␌�␌��␏l␜␒␒oo�␐�ccc|␒�b␓␒nn�␎l�␌�␌␌�␌�␎l␜␒␒nn�␐�ccc|␒�b␒␒on�␎l�␌�␌␌�␌␌l`␃␜␒␒nn�␐�bbb|␒�b␒␒nn�␎l�␌�␌␌�␌ll`␂␜␒␒nn�␐�bbb|␒�b␓␒on�␎l�␌�␌␌�l␌l`␂␜␒␒on�␐�cbb|␒�b␓␒oo�␎l�␌�␌␌�lll`␂␜␒␒nn�␐�bcc|␒�b␓␒on�␎l�␌�␌␌�l�l`␂␜␓␒oo�␐�ccc|␒�b␛␒og�␎l�␌�␌␌�l�d`␃␜␒␒on�␐�ccc|␒�b␓␒oo�␎l�␌�␌␌�l␌�␏l␜␒␒nn�␐�bbb|␒�b␒␒nn�␎l�␌�␌␌�ll�␎l␜␒␒oo�␐�bbb|␒�b␒␒nn�␎l�␌�␌␌�l��␎l␜␒␒on�␐�bbb|␒�b␒␒on�␎l�␌�␌␌�l�␏l␜␒␒nn�␐�bbb|␒�b␒␒nn�␎l�␌�␌␌�l␌l`␂␜␒␒oo�␐�ccc|␒�b␓␒no�␎l�␌�␌␌�lll`␂␜␒␒nn�␐�ccb|␒�b␒␒oo�␎l�␌�␌␌��␌l`␃␜␒␒oo�␐�cbb|␒�b␒␒oo�␎l�␌�␌␌��ll`␂␜␒␒on�␐�bbb|␒�b␒␒no�␎l�␌�␌␌���l`␂␜␒␒oo�␐�bcc|␒�b␒␒on�␎l�␌�␌␌���l`␂␜␒␒on�␐�bbc|␒�b␒␒no�␎l�␌�␌␌��␌�␏l␜␒␒nn�␐�ccc|␒�b␒␒nn�␎l�␌�␌␌��l�␏l␜␒␒on�␐�cbb|␒�b␓␒on�␎l�␌�␌␌����␎l␜␒␒nn�␐�bbb|␒�b␒␒nn�␎l�␌�␌␌���␎l␜␒␒on�␐�ccc|␒�b␓␒no�␎l�␌�␌␌��␌l`␃␜␒␒nn�␐�bcb|␒�b␒␒nn�␎l�␌�␌␌��ll`␂␜␒␒no�␐�bcb|␒�b␒␒nn�␎l�␌�␌␌��␌l`␃␜␒␒nn�␐�bbb|␒�b␓␒nn�␎l�␌�␌␌��ll`␂␜␒␒nn�␐�bbb|␒�b␓␒no�␎l�␌�␌␌��l`␃␜␒␒no�␐�bbb|␒�b␒␒no�␎l�␌�␌␌���l`␂␜␒␒nn�␐�bcb|␒�b␒␒nn�␎l�␌�␌␌��␌�␏l␜␒␒oo�␐�ccc|␒�b␓␒go�␎l�␄�␄␌��l�␏l␜␒␒oo�␐�ccc|␒�b␓␒oo�␎l�␌�␌␌�쌎␎l␜␒␒nn�␐�bbb|␒�b␒␒nn�␎l�␌�␌␌���␎l␜␒␒nn�␐�bbb|␒�b␒␒nn�␎l�␌�␌␌��␌l`␂␜␒␒oo�␐�bbb|␒�b␒␒oo�␎l�␌�␌␌��ll`␃␜␒␒no�␐�bcb|␒�b␓␒no�␎l�␌�␌␌�␌�l`␂␜␒␒nn�␐�bbb|␒�b␒␒nn�␎l�␌�␌␌�␌�prl␜␒␒on�␐�ccc|␒�b␓␒no�␎l�␌�␌␌�␌�␜rl␜␒␒no�␐�cbb|␒�b␓␒on�␎l�␌�␌␌�␌�␜rl␜␒␒on�␐�ccb|␒�b␒␒no�␎l�␌�␌␌�␌��␎l␜␒␒no�␐�bcb|␒�b␒␒on�␎l�␌�␌␌�␌�rrl␜␒␒no�␐�ccc|␒�b␒␒no�␎l�␌�␌␌�␌��␎l␜␒␒nn�␐�bbb|␒�b␒␒oo�␎l�␌�␌␌�␌��␎l␜␒␒oo�␐�ccc|␒�b␒␒oo�␎l�␌�␌␌�␌�l`␃␜␒␒on�␐�bcb|␒�b␒␒on�␎l�␌�␌␌�␌�|rl␜␒␒oo�␐�ccc|␒�b␓␒oo�␎l�␌�␌␌�l�l`␃␜␒␒no�␐�bcc|␒�b␓␒oo�␎l�␌�␌␌�l�prl␜␒␒nn�␐�bbb|␒�b␒␒nn�␎l�␌�␌␌�l�␜rl␜␒␒no�␐�bbb|␒�b␓␒on�␎l�␌�␌␌�l�␜rl␜␒␒on�␐�ccc|␒�b␓␒oo�␎l�␌�␌␌�l��␎l␜␒␒nn�␐�ccb|␒�b␒␒nn�␎l�␌�␌␌�l�rrl␜␒␒nn�␐�cbc|␒�b␓␒no�␎l�␌�␌␌�l��␎l␜␒␒no�␐�ccc|␒�b␓␒no�␎l�␌�␌␌�l��␏l␜␒␒on�␐�ccb|␒�b␒␒oo�␎l�␌�␌␌�l�l`␂␜␒␒oo�␐�ccc|␒�b␓␒no�␎l�␌�␌␄�d�|rl␜␒␒no�␐�ccb|␒�b␓␒nn�␎l�␌�␌l�␌␌l`␂␜␒␒on�␐�bcb|␒�b␒␒nn�␎l�␌�␌l�␌ll`␂␜␒␒nn�␐�bbb|␒�b␒␒no�␎l�␌�␌l�␌�l`␂␜␒␒nn�␐�bbb|␒�b␒␒nn�␎l�␌�␌l�␌�l`␃␜␒␒nn�␐�bbb|␒�b␒␒nn�␎l�␌�␌l�␌␌�␎l␜␒␒nn�␐�bbc|␒�b␒␒nn�␎l�␌�␌l�␌l�␎l␜␒␒no�␐�bbb|␒�b␒␒nn�␎l�␌�␌l�␌��␏l␜␒␒nn�␐�bbb|␒�b␒␒nn�␎l�␌�␌l�␌�␎l␜␒␒no�␐�ccc|␒�b␓␒on�␎l�␌�␌l�␌␌l`␃␜␒␒on�␐�bcb|␒�b␒␒on�␎l�␌�␌l�␌ll`␃␜␒␒on�␐�bcb|␒�b␓␒on�␎l�␌�␌l�l␌l`␃␜␒␒oo�␐�cbb|␒�b␓␒oo�␎l�␌�␌l�lll`␂␜␒␒no�␐�bcb|␒�b␓␒on�␎l�␌�␌l�l�l`␂␜␒␒on�␐�bcb|␒�b␒␒no�␎l�␌�␌l�l�l`␂␜␒␒on�␐�bcc|␒�b␓␒oo�␎l�␌�␌l�l␌�␎l␜␒␒oo�␐�bcc|␒�b␓␒nn�␎l�␌�␌l�ll�␏l␜␒␒oo�␐�bcc|␒�b␒␒oo�␎l�␌�␌l�l��␎l␜␒␒no�␐�bbc|␒�b␒␒nn�␎l�␌�␌l�l�␎l␜␒␒oo�␐�bbb|␒�b␓␒oo�␎l�␌�␌l�l␌l`␃␜␒␒no�␐�cbc|␒�b␓␒no�␎l�␌�␌l�lll`␃␜␒␒no�␐�bcb|␒�b␓␒nn�␎l�␌�␌l��␌l`␂␜␒␒on�␐�bcb|␒�b␒␒nn�␎l�␌�␌l��ll`␃␜␒␒on�␐�ccc|␒�b␓␒oo�␎l�␌�␌l���l`␂␜␒␒nn�␐�bbb|␒�b␒␒nn�␎l�␌�␌l���l`␂␜␒␒on�␐�bbb|␒�b␓␒oo�␎l�␌�␌l��␌�␎l␜␒␒on�␐�bcb|␒�b␒␒oo�␎l�␌�␌l��l�␏l␜␒␒on�␐�bcc|␒�b␒␒oo�␎l�␌�␌l����␎l␜␒␒oo�␐�ccb|␒�b␓␒oo�␎l�␌�␌l���␏l␜␒␒'g�␐�#c#<␒�b␛␒'o�␏l�␄�␌d��␌l`␃␜␒␒no�␐�ccb|␒�b␓␒oo�␎l�␌�␌l��ll`␃␜␒␒oo�␐�bbc|␒�b␒␒on�␎l�␌�␌l��␌l`␂␜␒␒nn�␐�cbc|␒�b␒␒nn�␎l�␌�␌l��ll`␃␜␒␒oo�␐�ccc|␒�b␓␒oo�␎l�␌�␌l��l`␃␜␒␒nn�␐�cbb|␒�b␒␒nn�␎l�␌�␌l���l`␂␜␒␒on�␐�cbc|␒�b␓␒oo�␎l�␌�␌l��␌�␏l␜␒␒nn�␐�bbc|␒�b␓␒no�␎l�␌�␌l��l�␎l␜␒␒nn�␐�bbb|␒�b␒␒nn�␎l�␌�␌l�쌎␏l␜␓␒''�␐�###<␒�#␛␒''�␇$�␄�␄$���␇l␜␒␒no�␐�bcb|␒�b␒␒oo�␎l�␌�␌l��␌l`␃␜␒␒no�␐�ccb|␒�b␓␒no�␎l�␌�␌l��ll`␃␜␒␒oo�␐�ccc|␒�b␓␒oo�␎l�␌�␌l�␌�l`␃␜␒␒nn�␐�bbb|␒�b␒␒on�␎l�␌�␌l�␌�prl␜␒␒no�␐�bbb|␒�b␒␒no�␎l�␌�␌l�␌�␜rl␜␒␒oo�␐�bcb|␒�b␒␒no�␎l�␌�␌l�␌�␜rl␜␒␒nn�␐�bbc|␒�b␒␒on�␎l�␌�␌l�␌��␎l␜␒␒on�␐�bbb|␒�b␒␒nn�␎l�␌�␌l�␌�rrl␜␒␒on�␐�cbb|␒�b␒␒nn�␎l�␌�␌l�␌��␎l␜␒␒oo�␐�cbc|␒�b␒␒no�␎l�␌�␌l�␌��␏l␜␒␛''�␐�###<␓�b␛␛''�␎$�␄�␄$�␄�$ ␃␜␒␒oo�␐�cbc|␒�b␓␒on�␎l�␌�␌l�␌�|rl␜␒␒nn�␐�cbc|␒�b␒␒on�␎l�␌�␌l�l�l`␂␜␒␒on�␐�ccb|␒�b␒␒no�␎l�␌�␌l�l�prlets Jun 8 2016 00:22:57
rst:0x1 (POWERON_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:1
load:0x3fff0018,len:4
load:0x3fff001c,len:1044
load:0x40078000,len:10124
load:0x40080400,len:5856
entry 0x400806a8
ets Jun 8 2016 00:22:57
rst:0x10 (RTCWDT_RTC_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:1
load:0x3fff0018,len:4
load:0x3fff001c,len:1044
load:0x40078000,len:10124
load:0x40080400,len:5856
entry 0x400806a8
Connecting to VUMA FIBER
Sunday, January 00 1900 00:00:00
Failed to obtain time
failed...retrying
00:00:01
failed...retrying
00:00:02
failed...retrying
00:00:03
failed...retrying
CONNECTED
Sunday, January 00 1900 00:00:00
Failed to obtain time
00:00:04
00:00:05
00:00:06
00:00:07
00:00:08
00:00:09
00:00:10
00:00:11
00:00:12
00:00:13
00:00:14
00:00:15
00:00:16
00:00:17
00:00:18
00:00:19
00:00:20
00:00:21
00:00:22
00:00:23
00:00:24
00:00:25
00:00:26
00:00:27
00:00:28
00:00:29
00:00:30
00:00:31
00:00:32
00:00:33
00:00:34
00:00:35
00:00:36
00:00:37
00:00:38
00:00:39
00:00:40
00:00:41
00:00:42
Any help would be appreciated

Why does the same OpenCL code have different outputs from Intel Xeon CPU and NVIDIA GTX 1080 Ti GPU?

I am trying to parallelize Monte Carlo simulation by using OpenCL. I use the MWC64X as a uniform random number generator. The code runs well on different Intel CPUs, since the output of parallel computation is very close to the sequential one.
Using OpenCL device: Intel(R) Xeon(R) CPU E5-2630L v3 # 1.80GHz
Literal influence running time: 0.029048 seconds r1 seqInfl= 0.4771
Literal influence running time: 0.029762 seconds r2 seqInfl= 0.4771
Literal influence running time: 0.029742 seconds r3 seqInfl= 0.4771
Literal influence running time: 0.02971 seconds ra seqInfl= 0.4771
Literal influence running time: 0.029225 seconds trust1-57 seqInfl= 0.6001
Literal influence running time: 0.04992 seconds trust110-1 seqInfl= 0
Literal influence running time: 0.034636 seconds trust4-57 seqInfl= 0
Literal influence running time: 0.049079 seconds trust57-110 seqInfl= 0
Literal influence running time: 0.024442 seconds trust57-4 seqInfl= 0.8026
Literal influence running time: 0.04946 seconds trust33-1 seqInfl= 0
Literal influence running time: 0.049071 seconds trust57-33 seqInfl= 0
Literal influence running time: 0.053117 seconds trust4-1 seqInfl= 0.1208
Literal influence running time: 0.051642 seconds trust57-1 seqInfl= 0
Literal influence running time: 0.052052 seconds trust57-64 seqInfl= 0
Literal influence running time: 0.052118 seconds trust64-1 seqInfl= 0
Literal influence running time: 0.051998 seconds trust57-7 seqInfl= 0
Literal influence running time: 0.052069 seconds trust7-1 seqInfl= 0
Total number of literals: 17
Sequential influence running time: 0.71728 seconds
Sequential maxInfluence Literal: trust57-4 0.8026
index1= 17 size= 51 dim1_size= 6
sum0:4781 influence0:0.478100 sum2:4781 influence2:0.478100 sum6:0 influence6:0.000000 sum10:0 sum12:0 influence12:0.000000 sum7:0 influence7:0.000000 influence10:0.000000 sum4:5962 influence4:0.596200 sum8:7971 influence8:0.797100 sum1:4781 influence1:0.478100 sum3:4781 influence3:0.478100 sum13:0 influence13:0.000000 sum11:1261 influence11:0.126100 sum9:0 influence9:0.000000 sum14:0 influence14:0.000000 sum5:0 influence5:0.000000 sum15:0 influence15:0.000000 sum16:0 influence16:0.000000
Parallel influence running time: 0.054391 seconds
Parallel maxInfluence Literal: trust57-4 Infl=0.7971
However, when I run the code on GeForce GTX 1080 Ti, with NVIDIA-SMI 430.40 and CUDA 10.1 and OpenCL 1.2 CUDA installed, the output is as below:
Using OpenCL device: GeForce GTX 1080 Ti
Influence:
Literal influence running time: 0.011119 seconds r1 seqInfl= 0.4771
Literal influence running time: 0.011238 seconds r2 seqInfl= 0.4771
Literal influence running time: 0.011408 seconds r3 seqInfl= 0.4771
Literal influence running time: 0.01109 seconds ra seqInfl= 0.4771
Literal influence running time: 0.011132 seconds trust1-57 seqInfl= 0.6001
Literal influence running time: 0.018978 seconds trust110-1 seqInfl= 0
Literal influence running time: 0.013093 seconds trust4-57 seqInfl= 0
Literal influence running time: 0.018968 seconds trust57-110 seqInfl= 0
Literal influence running time: 0.009105 seconds trust57-4 seqInfl= 0.8026
Literal influence running time: 0.018753 seconds trust33-1 seqInfl= 0
Literal influence running time: 0.018583 seconds trust57-33 seqInfl= 0
Literal influence running time: 0.02005 seconds trust4-1 seqInfl= 0.1208
Literal influence running time: 0.01957 seconds trust57-1 seqInfl= 0
Literal influence running time: 0.019686 seconds trust57-64 seqInfl= 0
Literal influence running time: 0.019632 seconds trust64-1 seqInfl= 0
Literal influence running time: 0.019687 seconds trust57-7 seqInfl= 0
Literal influence running time: 0.019859 seconds trust7-1 seqInfl= 0
Total number of literals: 17
Sequential influence running time: 0.272032 seconds
Sequential maxInfluence Literal: trust57-4 0.8026
index1= 17 size= 51 dim1_size= 6
sum0:10000 sum1:10000 sum2:10000 sum3:10000 sum4:10000 sum5:0 sum6:0 sum7:0 sum8:10000 sum9:0 sum10:0 sum11:0 sum12:0 sum13:0 sum14:0 sum15:0 sum16:0
Parallel influence running time: 0.193581 seconds
The "Influence" value equals sum*1.0/10000, thus the parallel influence only composes of 1 and 0, which is incorrect (in GPU runs) and doesn't happen when parallelizing on a Intel CPU.
When I check the output of the random number generator if(flag==0) printf("randint=%u",randint);, it seems the outputs are all zero on GPU. Below is the clinfo and the .cl code:
Device Name GeForce GTX 1080 Ti
Device Vendor NVIDIA Corporation
Device Vendor ID 0x10de
Device Version OpenCL 1.2 CUDA
Driver Version 430.40
Device OpenCL C Version OpenCL C 1.2
Device Type GPU
Device Topology (NV) PCI-E, 68:00.0
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 28
Max clock frequency 1721MHz
Compute Capability (NV) 6.1
Device Partition (core)
Max number of sub-devices 1
Supported partition types None
Max work item dimensions 3
Max work item sizes 1024x1024x64
Max work group size 1024
Preferred work group size multiple 32
Warp size (NV) 32
Preferred / native vector sizes
char 1 / 1
short 1 / 1
int 1 / 1
long 1 / 1
half 0 / 0 (n/a)
float 1 / 1
double 1 / 1 (cl_khr_fp64)
Half-precision Floating-point support (n/a)
Single-precision Floating-point support (core)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations Yes
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Address bits 64, Little-Endian
Global memory size 11720130560 (10.92GiB)
Error Correction support No
Max memory allocation 2930032640 (2.729GiB)
Unified memory for Host and Device No
Integrated memory (NV) No
Minimum alignment for any data type 128 bytes
Alignment of base address 4096 bits (512 bytes)
Global Memory cache type Read/Write
Global Memory cache size 458752 (448KiB)
Global Memory cache line size 128 bytes
Image support Yes
Max number of samplers per kernel 32
Max size for 1D images from buffer 134217728 pixels
Max 1D or 2D image array size 2048 images
Max 2D image size 16384x32768 pixels
Max 3D image size 16384x16384x16384 pixels
Max number of read image args 256
Max number of write image args 16
Local memory type Local
Local memory size 49152 (48KiB)
Registers per block (NV) 65536
Max number of constant args 9
Max constant buffer size 65536 (64KiB)
Max size of kernel argument 4352 (4.25KiB)
Queue properties
Out-of-order execution Yes
Profiling Yes
Prefer user sync for interop No
Profiling timer resolution 1000ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels No
Kernel execution timeout (NV) Yes
Concurrent copy and kernel execution (NV) Yes
Number of async copy engines 2
printf() buffer size 1048576 (1024KiB)
#define N 70 // N > index, which is the total number of literals
#define BASE 4294967296UL
//! Represents the state of a particular generator
typedef struct{ uint x; uint c; } mwc64x_state_t;
enum{ MWC64X_A = 4294883355U };
enum{ MWC64X_M = 18446383549859758079UL };
void MWC64X_Step(mwc64x_state_t *s)
{
uint X=s->x, C=s->c;
uint Xn=MWC64X_A*X+C;
uint carry=(uint)(Xn<C); // The (Xn<C) will be zero or one for scalar
uint Cn=mad_hi(MWC64X_A,X,carry);
s->x=Xn;
s->c=Cn;
}
//! Return a 32-bit integer in the range [0..2^32)
uint MWC64X_NextUint(mwc64x_state_t *s)
{
uint res=s->x ^ s->c;
MWC64X_Step(s);
return res;
}
__kernel void setInfluence(const int literals, const int size, const int dim1_size, __global int* lambdas, __global float* lambdap, __global int* dim2_size, __global float* influence){
int flag=get_global_id(0);
int sum=0;
int count=10000;
int assignment[N];
//or try to get newlambda like original version does
if(flag < literals){
mwc64x_state_t rng;
for(int i=0; i<count; i++){
for(int j=0; j<size; j++){
uint randint=MWC64X_NextUint(&rng);
float rand=randint*1.0/BASE;
//if(flag==0)
// printf("randint=%u",randint);
if(lambdap[j]<rand)
assignment[lambdas[j]]=0;
else
assignment[lambdas[j]]=1;
}
//the true case
assignment[flag]=1;
int valuet=0;
int index=0;
for(int m=0; m<dim1_size; m++){
int valueMono=1;
for(int n=0; n<dim2_size[m]; n++){
if(assignment[lambdas[index+n]]==0){
valueMono=0;
index+=dim2_size[m];
break;
}
}
if(valueMono==1){
valuet=1;
break;
}
}
//the false case
assignment[flag]=0;
int valuef=0;
index=0;
for(int m=0; m<dim1_size; m++){
int valueMono=1;
for(int n=0; n<dim2_size[m]; n++){
if(assignment[lambdas[index+n]]==0){
valueMono=0;
index+=dim2_size[m];
break;
}
}
if(valueMono==1){
valuef=1;
break;
}
}
sum += valuet-valuef;
}
influence[flag] = 1.0*sum/count;
printf("sum%d:%d\t", flag, sum);
}
}
What might be the problem when running the code on GPU? Is it MWC64X? According to its author, it can perform well on NVIDIA GPUs. If so, how can I fix it; if not, what might be the problem?

(This started out as a comment, it turns out this was the source of the problem so I'm turning it into an answer.)
You're not initialising your mwc64x_state_t rng; variable before reading from it, so any results will be undefined:
mwc64x_state_t rng;
for(int i=0; i<count; i++){
for(int j=0; j<size; j++){
uint randint=MWC64X_NextUint(&rng);
Where MWC64X_NextUint() immediately reads from the rng state before updating it:
uint MWC64X_NextUint(mwc64x_state_t *s)
{
uint res=s->x ^ s->c;
Note that you will probably want to seed your RNG differently for each work-item, otherwise you will get nasty correlation artifacts in your results.

All use-cases of a pseudo-random number are a next-level challenge in true-[PARALLEL] computing platforms (not languages, platforms).
Either, there is some source-of-randomness, which gets us into a trouble once massively-parallel requests are to get fair-handled in a truly [PARALLEL] fashion (here, hardware resources may help, yet at a cost of not being able to reproduce the same behaviour "outside" of this very same platform ( and moment-in-time, if such a source is not software-operated with some seed-injection feature, that may setup the "just"-pseudo-random algorithm that creates a pure-[SERIAL] sequence-of-produced "just"-pseudo-random numbers ) )
Or,there is some "shared"-generator of pseudo-random numbers, which enjoys of a higher level of system-wide level-of-entropy (which is good for the resulting "quality" of pseudo-randomness) but at a cost of pure-serial dependence (no parallel execution possible,serial sequence gets served one after another in a sequential manner) and having close to zero chance for repeatable runs (a must for reproducible science) providing repeatably same sequences, needed for testing and for method-validation cases.
RESUME :
The code may employ a work-item-"private" pseudo-random generating function(s) ( privacy is a must for the sake of both the parallel code-execution and the mutual independence (non-intervening processes) of generating these pseudo-random numbers ) , yet each of instances must be a) independently initialised, so as to provide the expected level of randomness achievable in parallelised code-runs and b) any such initialisation ought be performed in a repeatably reproducible manner, for the sake of running the test on different times, often using different OpenCL target computing-platforms.
For __kernel-s, that do not rely on hardware-specific sources-of-randomness, meeting the conditions a && b will suffice for receiving repeatably reproducible (same) results for testing "in vitro" and thus providing a reasonably random method for generating results during generic production-level use-case code-runs "in vivo".
The comparison of net-run-times (benchmarked above) seems to show that Amdahl's law add-on overhead costs plus a tail-end effect of the atomicity-of-work have finally decided the net-run-time was ~ 3.6x faster on XEON compared to GPU:
index1 = 17
size = 51
dim1_size = 6
sum0: 4781 influence0: 0.478100
sum2: 4781 influence2: 0.478100
sum6: 0 influence6: 0.000000
sum10: 0 influence10: 0.000000
sum12: 0 influence12: 0.000000
sum7: 0 influence7: 0.000000
sum4: 5962 influence4: 0.596200
sum8: 7971 influence8: 0.797100
sum1: 4781 influence1: 0.478100
sum3: 4781 influence3: 0.478100
sum13: 0 influence13: 0.000000
sum11: 1261 influence11: 0.126100
sum9: 0 influence9: 0.000000
sum14: 0 influence14: 0.000000
sum5: 0 influence5: 0.000000
sum15: 0 influence15: 0.000000
sum16: 0 influence16: 0.000000
Parallel influence running time: 0.054391 seconds on XEON E5-2630L v3 # 1.80GHz using OpenCL
|....
index1 = 17 |....
size = 51 |....
dim1_size = 6 |....
sum0: 10000 |....
sum1: 10000 |....
sum2: 10000 |....
sum3: 10000 |....
sum4: 10000 |....
sum5: 0 |....
sum6: 0 |....
sum7: 0 |....
sum8: 10000 |....
sum9: 0 |....
sum10: 0 |....
sum11: 0 |....
sum12: 0 |....
sum13: 0 |....
sum14: 0 |....
sum15: 0 |....
sum16: 0 |....
Parallel influence running time: 0.193581 seconds on GeForce GTX 1080 Ti using OpenCL

DirectX 11 Compute Shader device synchronization?

Background: perform benchmarking/comparisson over GPGPU platforms.
Problem: Device synchronization when dispatching a DirectX 11 Compute Shader.
Looking for the equivalent of cudaDeviceSynchronize() of clFinish(...) to make a fair comparisson of how my algorithm performs.
CUDA and OpenCL functions are more clear on the blocking/ non-blocking issues. DirectCompute however is more related to the graphics pipeline (of which I learning and very unfamiliar with) and therefore I have trouble finding out if a Dispatch call is blocking or if previously memory allocation/transfers are finished.
Code DX_1:
// Setup
...
for (...) {
startTimer();
context->Dispatch(number_of_groups, 1, 1);
times[i] = stopTimer();
}
// Release
...
Code DX_2:
for (...) {
// Setup
...
startTimer();
context->Dispatch(number_of_groups, 1, 1);
times[i] = stopTimer();
// Release
...
}
Results (average times of 2^2 to 2^11 elements):
DX_1 DX_2 CUDA
1.6 205.5 24.8
1.8 133.4 24.8
29.1 186.5 25.6
18.6 175.0 25.6
11.4 187.5 26.6
85.2 127.7 26.3
166.4 151.1 28.1
98.2 149.5 35.2
26.8 203.5 31.6
Notice: these times are run on a desktop GPU with a screen connected, some erratic timings are expected. Times are not supposed to include host to device buffer transfers.
Notice 2: These are very short sequences (4 - 2048 elements) the interesting tests are performed on problem sizes of up to 2^26 elements.

My new solution is to avoid synchronization with device. I have looked into some methods of retreiving timestamps instead, results look ok and I'm fairly sure the comparisons are fair enough. I compared my CUDA times (Event Record vs. QPC) and the difference is small, a seemingly constant overhead.
CUDA Event Host QPC
4,6 30,0
4,8 30,0
5,0 31,0
5,2 32,0
5,6 34,0
6,1 34,0
6,9 31,0
8,3 47,0
9,2 34,0
12,0 39,0
16,7 46,0
20,5 55,0
32,1 69,0
48,5 111,0
86,0 134,0
182,4 237,0
419,0 473,0
In case my question brings someone in hopes of finding how to do gpgpu benchmarking I will leave some code behind demonstrating my current benchmarking strategy.
Code Examples, CUDA
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
float milliseconds = 0;
cudaEventRecord(start);
...
// Launch my algorithm
...
cudaEventRecord(stop);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&milliseconds, start, stop);
OpenCL
cl_event start_event, end_event;
cl_ulong start = 0, end = 0;
// Enqueue a dummy kernel for the start event.
clEnqueueNDRangeKernel(..., &start_event);
...
// Launch my algorithm
...
// Enqueue a dummy kernel for the end event.
clEnqueueNDRangeKernel(..., &end_event);
clWaitForEvents(1, &end_event);
clGetEventProfilingInfo(start_event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, NULL);
clGetEventProfilingInfo(end_event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, NULL);
timeInMS = (double)(end - start)*(double)(1e-06);
DirectCompute
Here I followed the suggestion from Adam Miles and looked into that source. Will look something like this:
ID3D11Device* device = nullptr;
...
// Setup
...
ID3D11QueryPtr disjoint_query;
ID3D11QueryPtr q_start;
ID3D11QueryPtr q_end;
...
if (disjoint_query == NULL)
{
D3D11_QUERY_DESC desc;
desc.Query = D3D11_QUERY_TIMESTAMP_DISJOINT;
desc.MiscFlags = 0;
device->CreateQuery(&desc, &disjoint_query);
desc.Query = D3D11_QUERY_TIMESTAMP;
device->CreateQuery(&desc, &q_start);
device->CreateQuery(&desc, &q_end);
}
context->Begin(disjoint_query);
context->End(q_start);
...
// Launch my algorithm
...
context->End(q_end);
context->End(disjoint_query);
UINT64 start, end;
D3D11_QUERY_DATA_TIMESTAMP_DISJOINT q_freq;
while (S_OK != context->GetData(q_start, &start, sizeof(UINT64), 0)){};
while (S_OK != context->GetData(q_end, &end, sizeof(UINT64), 0)){};
while (S_OK != context->GetData(disjoint_query, &q_freq, sizeof(D3D11_QUERY_DATA_TIMESTAMP_DISJOINT), 0)){};
timeInMS = (((double)(end - start)) / ((double)q_freq.Frequency)) * 1000.0;
C/C++/OpenMP
static LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds, Frequency;
static void __inline startTimer()
{
QueryPerformanceFrequency(&Frequency);
QueryPerformanceCounter(&StartingTime);
}
static double __inline stopTimer()
{
QueryPerformanceCounter(&EndingTime);
ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
ElapsedMicroseconds.QuadPart *= 1000000;
ElapsedMicroseconds.QuadPart /= Frequency.QuadPart;
return (double)ElapsedMicroseconds.QuadPart;
}
My code examples are taken out of context and I tried to do some clean-up but errors might be present.

If you're interested in how long a particular Draw or Dispatch is taking on the GPU then you should take a look at DirectX 11's Timestamp queries. You can query the GPU's clock frequency and current clock value before and after some GPU work and figure out how long that took in wall time.
This is probably a good primer / example on how to do it:
https://mynameismjp.wordpress.com/2011/10/13/profiling-in-dx11-with-queries/

Units of QueryPerformanceFrequency

A simple question:
Which is the QueryPerformanceFrequency unit?
Hz (ticks per second)?
Thank you very much,
Bruno

Q: Units of QueryPerformanceFrequency?
A: KILO-HERTZ (NOT Hz)
=========== DETAILS ==============================================
My research indicates that both Counters and Freq are in KILOs, KILO-clock-ticks and KILO-HERTZ!
The counters register KILO-Clicks (KLICKS) and the freq is either in kHz or I am woefully UnderClocked. When you divide the Clock_Ticks by Clock_Frequency, kclicks/(kclicks*sec^-1), everything wipes out except for seconds.
Here is an example C program stripped to just the essentials:
#include "stdio.h"
#include <windows.h> // Needed for LARGE_INTEGER
// gcc cpu.freq.test.c -o cft.exe
// cft.exe -> Sleep d_KLICKS=3417790, d_time=0.999182880 sec, CPU_Freq=3420585 KILO-Hz
void main(int argc, char *argv[]) {
// Clock KILO-ticks start, end, CPU_Freq in kHz. KILOs cancel
LARGE_INTEGER sklick, eklick, cpu_khz;
double delta_time; // Expected time in SECONDS. All units above are k.
QueryPerformanceFrequency(&cpu_khz); // Gets clock KILO-tics, Klicks/sec
QueryPerformanceCounter(&sklick); // Capture cpu Start Klicks
Sleep(1000); // Sleep 1000 MILLI-seconds
QueryPerformanceCounter(&eklick); // Capture cpu End Klicks
delta_time = (eklick.QuadPart-sklick.QuadPart) / (double)cpu_khz.QuadPart;
printf("Sleep d_KLICKS=%lld, d_time=%4.9lf sec, CPU_Freq=%lld KILO-Hz\n",
eklick.QuadPart-sklick.QuadPart, delta_time, cpu_khz.QuadPart);
}
It actually compiles! Running...
Sleep d_KLICKS=3418803, d_time=0.999479036 sec, CPU_Freq=3420585 KILO-Hz
The CPU freq reads 3420585 or 3.420585E6 or 3.4 M-Hertz? <- MEGA-HURTS !OUCH!
The actual CPU freq is 3.4 Mega-Kilo-Hz or 3.4 GHz
microsoft appears to be confused (some things Never Change):
https://msdn.microsoft.com/en-us/library/windows/desktop/dn553408%28v=vs.85%29.aspx
QueryPerformanceFrequency(&Frequency);
QueryPerformanceCounter(&StartingTime);
// Activity to be timed
QueryPerformanceCounter(&EndingTime);
ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
// We now have the elapsed number of ticks, along with the
// number of ticks-per-second.
The number of "elapsed ticks" in 1 second is in the MILLIONS, NOT BILLIONS so they are NOT UNIT-CPU-CLOCK-TICKS but KILO-CPU-CLOCK-TICKS
Same off-by-3-orders-of-magnitude error for FREQ: 3.4 MILLION is not "ticks-per-second" but THOUSAND-ticks-per-second.
As long as you divide one by the other, the ?clicks cancel with a result in seconds. If one were so fatuous as to take ms at their document and try to use their "ticks-per-second" in some other calculation, you would wind up off by a factor of 1000 or ~1 standard_ms_error!
Perhaps we should call Heinrich in to check HIS units? Oops! 153 years too late. :(

CUDA - Multiple Threads

I am trying to make an LCG Random Number Generator run in parallel using CUDA & GPU's. However, I am having trouble actually getting multiple threads running at the same time.Here is a copy of the code:
#include <iostream>
#include <math.h>
__global__ void rng(long *cont)
{
int a=9, c=3, F, X=1;
long M=524288, Y;
printf("\nKernel X is %d\n", X[0]);
F=X;
Y=X;
printf("Kernel F is %d\nKernel Y is %d\n", F, Y);
Y=(a*Y+c)%M;
printf("%ld\t", Y);
while(Y!=F)
{
Y=(a*Y+c)%M;
printf("%ld\t", Y);
cont[0]++;
}
}
int main()
{
long cont[1]={1};
int X[1];
long *dev_cont;
int *dev_X;
cudaEvent_t beginEvent;
cudaEvent_t endEvent;
cudaEventCreate( &beginEvent );
cudaEventCreate( &endEvent );
printf("Please give the value of the seed X ");
scanf("%d", &X[0]);
printf("Host X is: %d", *X);
cudaEventRecord( beginEvent, 0);
cudaMalloc( (void**)&dev_cont, sizeof(long) );
cudaMalloc( (void**)&dev_X, sizeof(int) );
cudaMemcpy(dev_cont, cont, 1 * sizeof(long), cudaMemcpyHostToDevice);
cudaMemcpy(dev_X, X, 1 * sizeof(int), cudaMemcpyHostToDevice);
rng<<<1,1>>>(dev_cont);
cudaMemcpy(cont, dev_cont, 1 * sizeof(long), cudaMemcpyDeviceToHost);
cudaEventRecord( endEvent, 0);
cudaEventSynchronize (endEvent );
float timevalue;
cudaEventElapsedTime (&timevalue, beginEvent, endEvent);
printf("\n\nYou generated a total of %ld numbers", cont[0]);
printf("\nCUDA Kernel Time: %.2f ms\n", timevalue);
cudaFree(dev_cont);
cudaFree(dev_X);
cudaEventDestroy( endEvent );
cudaEventDestroy( beginEvent );
return 0;
}
Right now I am only sending one block with one thread. However, if I send 100 threads, the only thing that will happen is that it will produce the same number 100 times and then proceed to the next number. In theory this is what is meant to be expected but it automatically disregards the purpose of "random numbers" when a number is repeated.
The idea I want to implement is to have multiple threads. One thread will use that formula:
Y=(a*Y+c)%M but using an initial value of Y=1, then another thread will use the same formula but with an initial value of Y=1000, etc etc. However, once the first thread produces 1000 numbers, it needs to stop making more calculations because if it continues it will interfere with the second thread producing numbers with a value of Y=1000.
If anyone can point in the right direction, at least in the way of creating multiple threads with different functions or instructions inside of them, to run in parallel, I will try to figure out the rest.
Thanks!
UPDATE: July 31, 8:14PM EST
I updated my code to the following. Basically I am trying to produce 256 random numbers. I created the array where those 256 numbers will be stored. I also created an array with 10 different seed values for the values of Y in the threads. I also changed the code to request 10 threads in the device. I am also saving the numbers that are generated in an array. The code is not working correctly as it should. Please advise on how to fix it or how to make it achieve what I want.
Thanks!
#include <iostream>
#include <math.h>
__global__ void rng(long *cont, int *L, int *N)
{
int Y=threadIdx.x;
Y=N[threadIdx.x];
int a=9, c=3, i;
long M=256;
for(i=0;i<256;i++)
{
Y=(a*Y+c)%M;
N[i]=Y;
cont[0]++;
}
}
int main()
{
long cont[1]={1};
int i;
int L[10]={1,25,50,75,100,125,150,175,200,225}, N[256];
long *dev_cont;
int *dev_L, *dev_N;
cudaEvent_t beginEvent;
cudaEvent_t endEvent;
cudaEventCreate( &beginEvent );
cudaEventCreate( &endEvent );
cudaEventRecord( beginEvent, 0);
cudaMalloc( (void**)&dev_cont, sizeof(long) );
cudaMalloc( (void**)&dev_L, sizeof(int) );
cudaMalloc( (void**)&dev_N, sizeof(int) );
cudaMemcpy(dev_cont, cont, 1 * sizeof(long), cudaMemcpyHostToDevice);
cudaMemcpy(dev_L, L, 10 * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_N, N, 256 * sizeof(int), cudaMemcpyHostToDevice);
rng<<<1,10>>>(dev_cont, dev_L, dev_N);
cudaMemcpy(cont, dev_cont, 1 * sizeof(long), cudaMemcpyDeviceToHost);
cudaMemcpy(N, dev_N, 256 * sizeof(int), cudaMemcpyDeviceToHost);
cudaEventRecord( endEvent, 0);
cudaEventSynchronize (endEvent );
float timevalue;
cudaEventElapsedTime (&timevalue, beginEvent, endEvent);
printf("\n\nYou generated a total of %ld numbers", cont[0]);
printf("\nCUDA Kernel Time: %.2f ms\n", timevalue);
printf("Your numbers are:");
for(i=0;i<256;i++)
{
printf("%d\t", N[i]);
}
cudaFree(dev_cont);
cudaFree(dev_L);
cudaFree(dev_N);
cudaEventDestroy( endEvent );
cudaEventDestroy( beginEvent );
return 0;
}
#Bardia - Please let me know how I can change my code to accommodate my needs.
UPDATE: August 1, 5:39PM EST
I edited my code to accommodate #Bardia's modifications to the Kernel code. However a few errors in the generation of numbers are coming out. First, the counter that I created in the kernel to count the amount of numbers that are being created, is not working. At the end it only displays that "1" number was generated. The Timer that I created to measure the time it takes for the kernel to execute the instructions is also not working because it keeps displaying 0.00 ms. And based on the parameters that I have set for the formula, the numbers that are being generated and copied into the array and then printed on the screen do not reflect the numbers that are meant to appear (or even close). These all used to work before.
Here is the new code:
#include <iostream>
#include <math.h>
__global__ void rng(long *cont, int *L, int *N)
{
int Y=threadIdx.x;
Y=L[threadIdx.x];
int a=9, c=3, i;
long M=256;
int length=ceil((float)M/10); //256 divided by the number of threads.
for(i=(threadIdx.x*length);i<length;i++)
{
Y=(a*Y+c)%M;
N[i]=Y;
cont[0]++;
}
}
int main()
{
long cont[1]={1};
int i;
int L[10]={1,25,50,75,100,125,150,175,200,225}, N[256];
long *dev_cont;
int *dev_L, *dev_N;
cudaEvent_t beginEvent;
cudaEvent_t endEvent;
cudaEventCreate( &beginEvent );
cudaEventCreate( &endEvent );
cudaEventRecord( beginEvent, 0);
cudaMalloc( (void**)&dev_cont, sizeof(long) );
cudaMalloc( (void**)&dev_L, sizeof(int) );
cudaMalloc( (void**)&dev_N, sizeof(int) );
cudaMemcpy(dev_cont, cont, 1 * sizeof(long), cudaMemcpyHostToDevice);
cudaMemcpy(dev_L, L, 10 * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_N, N, 256 * sizeof(int), cudaMemcpyHostToDevice);
rng<<<1,10>>>(dev_cont, dev_L, dev_N);
cudaMemcpy(cont, dev_cont, 1 * sizeof(long), cudaMemcpyDeviceToHost);
cudaMemcpy(N, dev_N, 256 * sizeof(int), cudaMemcpyDeviceToHost);
cudaEventRecord( endEvent, 0);
cudaEventSynchronize (endEvent );
float timevalue;
cudaEventElapsedTime (&timevalue, beginEvent, endEvent);
printf("\n\nYou generated a total of %ld numbers", cont[0]);
printf("\nCUDA Kernel Time: %.2f ms\n", timevalue);
printf("Your numbers are:");
for(i=0;i<256;i++)
{
printf("%d\t", N[i]);
}
cudaFree(dev_cont);
cudaFree(dev_L);
cudaFree(dev_N);
cudaEventDestroy( endEvent );
cudaEventDestroy( beginEvent );
return 0;
}
This is the output I receive:
[wigberto#client2 CUDA]$ ./RNG8
You generated a total of 1 numbers
CUDA Kernel Time: 0.00 ms
Your numbers are:614350480 32767 1132936976 11079 2 0 10 0 1293351837 0 -161443660 48 0 0 614350336 32767 1293351836 0 -161444681 48 614350760 32767 1132936976 11079 2 0 10 0 1057178751 0 -161443660 48 155289096 49 614350416 32767 1057178750 0 614350816 32767 614350840 32767 155210544 49 0 0 1132937352 11079 1130370784 11079 1130382061 11079 155289096 49 1130376992 11079 0 1 1610 1 1 1 1130370408 11079 614350896 32767 614350816 32767 1057178751 0 614350840 32767 0 0 -161443150 48 0 0 1132937352 11079 1 11079 0 0 1 0 614351008 32767 614351032 32767 0 0 0 0 0 0 1130369536 1 1132937352 11079 1130370400 11079 614350944 32767 1130369536 11079 1130382061 11079 1130370784 11079 1130365792 11079 6143510880 614351008 32767 -920274837 0 614351032 32767 0 0 -161443150 48 0 0 0 0 1 0 128 0-153802168 48 614350896 32767 1132839104 11079 97 0 88 0 1 0 155249184 49 1130370784 11079 0 0-1 0 1130364928 11079 2464624 0 4198536 0 4198536 0 4197546 0 372297808 0 1130373120 11079 -161427611 48 111079 0 0 1 0 -153802272 48 155249184 49 372297840 0 -1 0 -161404446 48 0 0 0 0372298000 0 372297896 0 372297984 0 0 0 0 0 1130369536 11079 84 0 1130471067 11079 6303744 0614351656 32767 0 0 -1 0 4198536 0 4198536 0 4197546 0 1130397880 11079 0 0 0 0 0 0 00 0 0 -161404446 48 0 0 4198536 0 4198536 0 6303744 0 614351280 32767 6303744 0 614351656 32767 614351640 32767 1 0 4197371 0 0 0 0 0 [wigberto#client2 CUDA]$
#Bardia - Please advise on what is the best thing to do here.
Thanks!

You can address threads within a block by threadIdx variable.
ie., in your case you should probably set
Y = threadIdx.x and then use Y=(a*Y+c)%M
But in general implementing a good RNG on CUDA could be really difficult.
So I don't know if you want to implement your own generator just for practice..
Otherwise there is a CURAND library available which provides a number of pseudo- and quasi-random generators, ie. XORWOW, MersenneTwister, Sobol etc.

It should do the same work in all threads, because you want them to do the same work. You should always distinguish threads from each other with addressing them.
For example you should say thread #1 you do this job and save you work here and thread #2 you do that job and save your work there and then go to Host and use that data.
For a two dimensional block grid with two dimension threads in each block I use this code for addressing:
int X = blockIdx.x*blockDim.x+threadIdx.x;
int Y = blockIdx.y*blockDim.y+threadIdx.y;
The X and Y in the code above are the global address of your thread (I think for your a one dimensional grid and thread is sufficient).
Also remember that you can not use the printf function on the kernel. The GPUs can't make any interrupt. For this you can use cuPrintf function which is one of CUDA SDK's samples, but read it's instructions to use it correctly.

This answer relates to the edited part of the question.
I didn't notice that it is a recursive Algorithm and unfortunately I don't know how to parallelize a recursive algorithm.
My only idea for generating these 256 number is to generate them separately. i.e. generate 26 of them in the first thread, 26 of them on the second thread and so on.
This code will do this (this is only kernel part):
#include <iostream>
#include <math.h>
__global__ void rng(long *cont, int *L, int *N)
{
int Y=threadIdx.x;
Y=L[threadIdx.x];
int a=9, c=3, i;
long M=256;
int length=ceil((float)M/10); //256 divided by the number of threads.
for(i=(threadIdx.x*length);i<length;i++)
{
Y=(a*Y+c)%M;
N[i]=Y;
cont[0]++;
}
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Minimize cudaDeviceSynchronize launch overhead - parallel-processing

Related

ESP32 not functioning WIFI

Why does the same OpenCL code have different outputs from Intel Xeon CPU and NVIDIA GTX 1080 Ti GPU?

DirectX 11 Compute Shader device synchronization?

Units of QueryPerformanceFrequency

CUDA - Multiple Threads

Categories

Resources