OpenCL Kernel with image transfer running slow - performance

I have a pretty simple OpenCL kernel, basically doing nothing more than defining its input:
__kernel void test(__read_only image3d_t d_multitest){}
My host side code is basic pyopencl to transfer an image to my device and run the kernel:
import pyopencl as cl
import numpy as np
platform = cl.get_platforms()[0]
devs = platform.get_devices()
device1 = devs[1]
h_test = np.zeros((64,512,512)).astype(np.float32,order='F')
mf = cl.mem_flags
ctx = cl.Context([device1])
Queue1 = cl.CommandQueue(ctx,properties=cl.command_queue_properties.PROFILING_ENABLE)
Queue2 = cl.CommandQueue(ctx,properties=cl.command_queue_properties.PROFILING_ENABLE)
event_list=[]
fi = open('Minimal.cl', 'r')
fstr = "".join(fi.readlines())
prg = cl.Program(ctx, fstr).build()
knl = prg.test
d_test = cl.Image(ctx,mf.READ_ONLY, cl.ImageFormat(cl.channel_order.INTENSITY, cl.channel_type.FLOAT), h_test.shape)
e1 = cl.enqueue_copy(Queue1, d_test, h_test, is_blocking = False, origin = (0,0,0), region = h_test.shape)
knl.set_args(d_test)
cl.enqueue_nd_range_kernel(Queue2,knl,(512,512,64),None,wait_for=[e1,])
I am profiling this code on different devices and see, that the transfer time basically scales with the memory bandwith of the device, which is expected. On the other hand, my Kernel execution time varies wildly.
On Nvidia the Kernel execution duration is <<1ms.
however, the exact same Kernel takes >20ms on AMDs implementation.
My question is, if this kind of overhead is tolerable, or if I am doing something basically wrong.
Hardware:
NVIDIA GeForce GTX TITAN X
AMD ATI Radeon 9 290X
Host:
Ubuntu 16.04

Related

Add i2s Audio in device tree for SAM9x60 board

Our team has a SAM9x60 board and recently add an external audio board (UDA1334A, link: Documents). Unfortunately, this document has guide for Raspberry Pi only, and somehow it's really different with our board device tree. So I have tried myself to add into device tree, mostly based on SAM9x60's Tutorial with another board, but it's really different.
As I understand, the audio board use UDA1334 codec, and I have to add a sound tag to device tree, like SAM9x60 tutorial:
sound {
compatible = "mikroe,mikroe-proto";
model = "wm8731 # sam9x60ek";
i2s-controller = <&i2s>;
audio-codec = <&wm8731>;
dai-format = "i2s";
};
But I haven't found any driver for this card. After look around, I tried with simple-audio-card
sound {
compatible = "simple-audio-card";
simple-audio-card,name = "1334 Card";
simple-audio-card,format = "i2s";
simple-audio-card,widgets = "Speaker", "Speakers";
simple-audio-card,routing = "Speakers", "Speaker";
simple-audio-card,bitclock-master = <&codec_dai>;
simple-audio-card,frame-master = <&codec_dai>;
simple-audio-card,cpu {
#sound-dai-cells = <0>;
sound-dai = <&i2s>;
};
codec_dai: simple-audio-card,codec {
#sound-dai-cells = <1>;
sound-dai = <&uda1334>;
};
};
uda1334: codec#1a {
compatible = "nxp,uda1334";
nxp,mute-gpios = <&pioA 8 GPIO_ACTIVE_LOW>;
nxp,deemph-gpios = <&pioC 3 GPIO_ACTIVE_LOW>;
status = "okay";
};
When booting, I received message:
OF: /sound/simple-audio-card,codec: could not get #sound-dai-cells for /codec#1a
asoc-simple-card sound: parse error -22
asoc-simple-card: probe of sound failed with error -22
So have I do the right way with simple-audio-card? Or any other way? In normal, ALSA recorded a classD sound card, but I think it is just a amplifier. Sorry because I'm an Android SW Developer and have to do the HW job from a quit people.
External Question: I have investigate on Raspberry device tree based on UDA1334 document, it's so different, as I understand, Rasp use HiFiberry Dac already, but how could it work with an external DAC like UDA1334? No external node in device tree I've seen? Look like they just open dtoverlay=hifiberry-dac, dtoverlay=i2s-mmap and it work.

Pytorch embedding too big for GPU but fits in CPU

I am using PyTorch lightning, so lightning control GPU/CPU assignments and in
return I get easy multi GPU support for training.
I would like to create an embedding that does not fit in the GPU memory.
fit_in_cpu = torch.nn.Embedding(too_big_for_GPU, embedding_dim)
Then when I select the subset for a batch, send it to the GPU
GPU_tensor = embedding(idx)
How do I do this in Pytorch Lightning?
Lightning will send anything that is registered as a model parameter to GPU, i.e: weights of layers (anything in torch.nn.*) and variables registered using torch.nn.parameter.Parameter.
However if you want to declare something in CPU and then on runtime move it to GPU you can go 2 ways:
Create the too_big_for_GPU inside the __init__ without registering it as a model parameter (using torch.zeros or torch.randn or any other init function). Then move it to the GPU on the forward pass
class MyModule(pl.LightningModule):
def __init__():
self.too_big_for_GPU = torch.zeros(4, 1000, 1000, 1000)
def forward(self, x):
# Move tensor to same GPU as x and operate with it
y = self.too_big_for_GPU.to(x.device) * x**2
return y
Create the too_big_for_GPU which will be created by default in CPU and then you would need to move it to GPU
class MyModule(pl.LightningModule):
def forward(self, x):
# Create the tensor on the fly and move it to x GPU
too_big_for_GPU = torch.zeros(4, 1000, 1000, 1000).to(x.device)
# Operate with it
y = too_big_for_GPU * x**2
return y

HAL_GetTick() crash mcu

I created a simple project using STCubeMX for my nucleo-f446ZE(STM32F446ZET6).
The project should be a USB device HID but it fail to start. After messing around with the debugger, I discovered that the MCU PC register go to 0x00000000 or 0xFFFFFFFF or sometimes random invalid value.
I didn't modify any code. I compiled the code with MDK-ARM (modified GCC, Vision IDE), and with GCC (openSTM32) and the same thing happen.
Callstack :
Main
SystemClock_Config
HAL_RCC_ClockConfig (632)
Hal_GetTick
Ps:
RAM got corrupted after 0x080149A and that why the program do weird stuff
Image
Solution
CubeMX didn't setup clocks very well. here is the setup i used to make work the usb.
//RCC_OscInitStruct.OscillatorType = RCC_OSCILLATORTYPE_HSI;
RCC_OscInitStruct.OscillatorType = RCC_OSCILLATORTYPE_HSE;
//RCC_OscInitStruct.HSIState = RCC_HSI_ON;
//RCC_OscInitStruct.HSICalibrationValue = 16;
RCC_OscInitStruct.HSEState = RCC_HSE_ON;
RCC_OscInitStruct.PLL.PLLState = RCC_PLL_ON;
RCC_OscInitStruct.PLL.PLLSource = RCC_PLLSOURCE_HSE;
RCC_OscInitStruct.PLL.PLLM = 8;
RCC_OscInitStruct.PLL.PLLN = 192;
RCC_OscInitStruct.PLL.PLLP = RCC_PLLP_DIV4;
RCC_OscInitStruct.PLL.PLLQ = 4;
RCC_OscInitStruct.PLL.PLLR = 2;
The RCC_ClkInitStruct is probably not initialized properly (or at all)

Arduino not writing to file

I'd like the arduino to write to a file whenever an ajax call is made. The ajax works, but it doesn't write to the file. All other code inside the ajax handler does execute.
void handle_ajax(){
int startUrlIndex= HTTP_req.indexOf("button");
int endUrlIndex = HTTP_req.indexOf(" HTTP");
String url = HTTP_req.substring(startUrlIndex, endUrlIndex);
int startButtonIndex = url.indexOf("device-") + 7;// 7 is length of device-, I really just want the number.
int endButtonIndex = url.indexOf("&");
String button = url.substring(startButtonIndex, endButtonIndex);
int startStateIndex = url.indexOf("state=") + 6; // 6 is length of state=, I really just want the number.
String state = url.substring(startStateIndex);
int device = button.toInt();
int newState = state.toInt();
dim_light(device, newState * 12);
write_config("", "text");
}
bool write_config(String line, String text){
configFile = SD.open("config.ini", FILE_WRITE);
if(configFile){
configFile.write("Dipshit");
}
configFile.close();
Serial.println("Works.");
return true;
}
I don't see anything wrong with the code provided.
Check the basics first:
SD card is either a standard SD card or a SDHC card.
SD card is formatted with a FAT16 or FAT32 file system.
The correct pin has been used for the CS pin in the SD.begin() command. This depends on the hardware used. http://www.arduino.cc/en/Reference/SDCardNotes
The SPI is wired up correctly (pins 11, 12, and 13 on most Arduino boards).
The hardware SS pin is set as an output (even if it isn't used as the CS pin).
I know from past experience that these little Arduinos can run out of SRAM quite quickly when reading and writing to an SD card. The ReadWrite example program uses about 50% of the UNOs SRAM alone!!
To test if this is your problem, run the SD card read/write example program (with the correct CS pin in the SD.begin() command). If this works then the problem is likely that you have run out of SRAM. Try using an Arduino MEGA 2560 instead which has 4x the amount of SRAM.
Edit: The latest Arduino IDE (v1.6.8) actually calculates how much SRAM is used by global variables. It does not take into account local variables.
Found the problem: Ram
The arduino had insufficient ram at the point of opening the SD card resulting in a failure.
If anyone else ever encounters the same issue, you need 300 or more bytes of ram. Check this by serial printing FreeRam()

Python OpenCV VideoWriter slows down at around 3000 write()

I am creating a video from 5100 full size images (each about 5000 X 3000 px) using OpenCV and Python. Works great, however, at around image 3000 (with the AVI file at about 160 MB) the write() method slows down significant.
I'm cropping the image, and then I'm resizing before I write. I've put a lot of timing code below.
CPU isn't the issue... lots of disk space... Has anyone seen this before? (I've searched high and low) I've been looking for memory leaks, but the code is too simple for that.
import numpy as np
import cv2, glob, time
files = glob.glob("./pics/*.JPG")
video = cv2.VideoWriter('test.avi',cv2.cv.CV_FOURCC('m', 'p', '4', 'v'),120,(1920,1080))
print "INDEX, IMREAD, CROP_IMG, RESIZE, WRITE, TOTAL"
for idx, file in enumerate(files):
offset=idx/500 #just an offset to make sure that it reflects my real world.
start = time.time()
img = cv2.imread(file)
imread_dur = time.time() - start
start = time.time()
crop_img = img[1000+offset:2050+offset, 500-offset:2370-offset]
crop_dur = time.time() - start
start = time.time()
resized_image = cv2.resize(crop_img, (1920, 1080))
resize_dur = time.time() - start
start = time.time()
video.write(resized_image)
write_dur = time.time() - start
print "%5d, %3.6f, %3.6f, %3.6f, %3.6f, %3.6f " % (idx,imread_dur, crop_dur, resize_dur, write_dur, imread_dur+crop_dur+resize_dur+write_dur)
video.release()
At around the 2900th image, it goes from about 200ms to video.write() to around 2000ms or more per image. It feels like a buffer problem on the output video file, but nothing seems configurable. (The CPU goes from almost a full core to ~3% as the write() slows down.) At 3000, it is taking 13s for a write()
INDEX, IMREAD, CROP_IMG, RESIZE, WRITE, TOTAL
...
2847, 0.232032, 0.004127, 0.015886, 0.214999, 0.467044
2848, 0.233260, 0.003745, 0.013973, 0.214703, 0.465681
2849, 0.228818, 0.003882, 0.016408, 0.251602, 0.500710
...
3096, 0.238894, 0.003780, 0.013936, 13.183812, 13.440422
3097, 0.253249, 0.004052, 0.015534, 13.668831, 13.941666
I've looked under the covers and nothing is popping up. Anyone seen this?

Resources