Opencl Segmentation fault while accessing device info

Opencl Segmentation fault while accessing device info - parallel-processing

I'm a newbie in OpenCL programming.
My very first program is giving me a hard time. I wanted to query device name and vendor name of every device in each platform. My system has two platforms, the first one is AMD platform and the second is NVIDIA CUDA platform. I've written the following code to get the desired info.
int main(int argc, char **argv) {
try {
vector<cl::Platform>platforms;
cl::Platform::get(&platforms);
cl_context_properties properties[] = {CL_CONTEXT_PLATFORM, (cl_context_properties)(platforms[0])(), 0};
cl::Context context(CL_DEVICE_TYPE_ALL, properties);
vector<cl::Device> devices = context.getInfo<CL_CONTEXT_DEVICES>();
string dName(devices[0].getInfo<CL_DEVICE_NAME>());
string vendor(devices[0].getInfo<CL_DEVICE_VENDOR>());
cout<<"\tDevice Name:"<<dName<<endl;
cout<<"\tDevice Vendor: "<<vendor<<endl;
}catch(cl::Error err) {
cerr<<err.what()<<" error: "<<printErrorString(err.err())<<endl;
return 0;
}
}
when I change the platform index to 1 in
cl_context_properties properties[] = {CL_CONTEXT_PLATFORM, (cl_context_properties)(platforms[0])(), 0};
my program crashes with 'Segmentation fault'.
I really appreciate your help.
Thanks!

I suspect that you are using the cl.hpp header file from the AMD APP SDK? If that is the case then the problem is that the header file calls an OpenCL 1.2 function (can't remember which one) that is supplied by the AMD devices in your system but not by the Nvidia GPU. Your Nvidia GPU only supports OpenCL 1.1. The best solution I know is to use the header files for OpenCL 1.1 from the Khronos website.

Related

Why are OpenGL and CUDA contexts memory greedy?

I develop software which usually includes both OpenGL and Nvidia CUDA SDK. Recently, I also started to seek ways to optimize run-time memory footprint. I noticed the following (Debug and Release builds differ only by 4-7 Mb):
Application startup - Less than 1 Mb total
OpenGL 4.5 context creation ( + GLEW loader init) - 45 Mb total
CUDA 8.0 context (Driver API) creation 114 Mb total.
If I create OpenGL context in "headless" mode, the GL context uses 3 Mb less, which probably goes to default frame buffers allocation. That makes sense as the window size is 640x360.
So after OpenGL and CUDA context are up, the process already consumes 114 Mb.
Now, I don't have deep knowledge regarding OS specific stuff that occurs under the hood during GL and CUDA context creation, but 45 Mb for GL and 68 for CUDA seems a whole lot to me. I know that usually several megabytes goes to system frame buffers, function pointers,(probably a bulk of allocations happens on driver side). But hitting over 100 Mb with just "empty" contexts looks too much.
I would like to know:
Why GL/CUDA context creation consumes such a considerable amount of memory?
Are there ways to optimize that?
The system setup under test:
Windows 10 64bit. NVIDIA GTX 960 GPU (Driver Version:388.31). 8 Gb RAM. Visual Studio 2015, 64bit C++ console project.
I measure memory consumption using Visual Studio built-in Diagnostic Tools -> Process Memory section.
UPDATE
I tried Process Explorer, as suggested by datenwolf. Here is the screenshot of what I got, (my process at the bottom marked with yellow):
I would appreciate some explanation on that info. I was always looking at "Private Bytes" in "VS Diagnostic Tools" window. But here I see also "Working Set", "WS Private" etc. Which one correctly shows how much memory my process currently uses? 281,320K looks way too much, because as I said above, the process at the startup does nothing, but creates CUDA and OpenGL contexts.

Partial answer: This is an OS-specific issue; on Linux, CUDA takes 9.3 MB.
I'm using CUDA (not OpenGL) on GNU/Linux:
CUDA version: 10.2.89
OS distribution: Devuan GNU/Linux Beowulf (~= Debian Buster without systemd)
Kernel: Linux 5.2.0
Processor: Intel x86_64
To check how much memory gets used by CUDA when creating a context, I ran the following C program (which also checks what happens after context destruction):
#include <stdio.h>
#include <cuda.h>
#include <malloc.h>
#include <stdlib.h>
static void print_allocation_stats(const char* s)
{
printf("%s:\n", s);
printf("--------------------------------------------------\n");
malloc_stats();
printf("--------------------------------------------------\n\n");
}
int main()
{
display_mallinfo("Initially");
int status = cuInit(0);
if (status != 0 ) { return EXIT_FAILURE; }
print_allocation_stats("After CUDA driver initialization");
int device_id = 0;
unsigned flags = 0;
CUcontext context_id;
status = cuCtxCreate(&context_id, flags, device_id);
if (status != CUDA_SUCCESS ) { return EXIT_FAILURE; }
print_allocation_stats("After context creation");
status = cuCtxDestroy(context_id);
if (status != CUDA_SUCCESS ) { return EXIT_FAILURE; }
print_allocation_stats("After context destruction");
return EXIT_SUCCESS;
}
(note that this uses a glibc-specific function, not in the standard library.)
Summarizing the results and snipping irrelevant parts:
Point in program
Total bytes
In-use
Max MMAP Regions
Max MMAP bytes
Initially
135168
1632
0
0
After CUDA driver initialization
552960
439120
2
307200
After context creation
9314304
6858208
8
6643712
After context destruction
7016448
580688
8
6643712
So CUDA starts with 0.5 MB and after allocating a context takes up 9.3 MB (going back down to 7.0 MB on destroying the context). 9 MB is still a lot of memory for not having done anything; but - maybe some of it is all-zeros, or uninitialized, or copy-on-write, in which case it doesn't really take up that much memory.
It's possible that memory use improved dramatically over the two years between the driver release with CUDA 8 and with CUDA 10, but I doubt it. So - it looks like your problem is Windows specific.
Also, I should mention I did not create an OpenGL context - which is another part of OP's question; so I haven't estimated how much memory that takes. OP brings up the question of whether the sum is greater than its part, i.e. whether a CUDA context would take more memory if an OpenGL context existed as well; I believe this should not be the case, but readers are welcome to try and report...

PIC18F XC8 compiler - objects not initialized?

I have to use a Microchip PIC for a new project (needed high pin count on a TQFP60 package with 5V operation).
I have a huge problem, I might miss something (sorry for that in advance).
IDE: MPLAB X 3.51
Compiler: XC8 1.41
The issue is that if I initialize an object to anything other than 0, it will not be initialized, and always be zero when I reach the main();
In simulator it works, and the object value is the proper one.
Simple example:
#include <xc.h>
static int x= 0x78;
void main(void) {
while(x){
x++;
}
return;
}
In simulator the x is 0x78 and the while(x) is true.
BUT when I load the code to the PIC18F67K40 using PICKIT3, the x is 0.
This happening even if I do a simple sprintf, and it does nothing as the formatting text string (char array) is full of zeros.
sprintf(buf,"Number is %u",x")
I can not initialize any object apart to be zero.
What is going on? Any help appreciated!

Found the problem, The chip has an errata issues, and I got the one which is effected, strange, Farnell sells it. More strange that the compiler is not prepared for that, does not even give a warning to say to be careful!
Errata note:
Module: PIC18 Core
3.1 TBLRD requires NVMREG value to point to
appropriate memory
The affected silicon revisions of the PIC18FXXK40
devices improperly require the NVMREG<1:0>
bits in the NVMCON register to be set for TBLRD
access of the various memory regions. The issue
is most apparent in compiled C programs when the
user defines a const type and the compiler uses
TBLRD instructions to retrieve the data from
program Flash memory (PFM). The issue is also
apparent when the user defines an array in RAM
for which the complier creates start-up code,
executed before main(), that uses TBLRD
instructions to initialize RAM from PFM.

How to add device tree blob to Linux x86 kernel boot?

My custom development board is based on x86 and one of the electronic component which is connected to it (through SPI mainly) cannot be controlled easily without using the vendor kernel driver (and the vendor won't help if I don't use it). This module requires some configuration parameters that it gets from the device tree. I believe this module is mostly used on ARM platforms where device trees are common.
On x86, the device tree is generally not needed so it is disabled by default during Linux kernel compilation. I changed the configuration in order to enable it, but I cannot find the way to put the device tree BLOB into the boot image. There is only one DTS file for the x86 architecture in the kernel sources but it doesn't seem to be used at all so it doesn't help.
From the kernel documentation, I understand I need to put it in the setup_data field of the x86 real-mode kernel header, but I don't understand how to do that and when (at kernel build time? when building the bootloader?). Am I supposed to hack the arch/x86/boot/header.S file directly?
Right now, I've replaced the module configuration by hard-coded values, but using the device tree would be better.

On x86, the boot loader adds the Device Tree binary data (DTB) to the linked list of setup_data structures before calling the kernel entry point. The DTB can be loaded from a storage device or embedded into the boot loader image.
The following code shows how it's implemented in U-Boot.
http://git.denx.de/?p=u-boot.git;a=blob;f=arch/x86/lib/zimage.c:
static int setup_device_tree(struct setup_header *hdr, const void *fdt_blob)
{
int bootproto = get_boot_protocol(hdr);
struct setup_data *sd;
int size;
if (bootproto < 0x0209)
return -ENOTSUPP;
if (!fdt_blob)
return 0;
size = fdt_totalsize(fdt_blob);
if (size < 0)
return -EINVAL;
size += sizeof(struct setup_data);
sd = (struct setup_data *)malloc(size);
if (!sd) {
printf("Not enough memory for DTB setup data\n");
return -ENOMEM;
}
sd->next = hdr->setup_data;
sd->type = SETUP_DTB;
sd->len = fdt_totalsize(fdt_blob);
memcpy(sd->data, fdt_blob, sd->len);
hdr->setup_data = (unsigned long)sd;
return 0;
}

pcap_sendpacket will send two identical packets once

I'm trying to send some self-packed Ethernet packets via the Winpcap API pcap_sendpacket(), but I got two identical packets after invoking the API once. The two packets can be captured on Wireshark for debugging purpose, with identical data and continuous frame numbers.
The environment is Win7 64bit. And it is wierd that the same code base running on another Win7 64bit will show only one packet on Wireshark.
Edit:
[2016.1.24 19:30]
I'm sorry I can only post the pcap related code parts due to the confidential thing
// first, enum the device list
pcap_if_t *m_alldevs;
char errbuf[PCAP_ERRBUF_SIZE];
if (pcap_findalldevs(&m_alldevs, errbuf) == -1)
{
// log error ...
for(pcap_if_t *d = m_alldevs; d != NULL; d = d->next)
{
// second, open the interface
// use flag PCAP_OPENFLAG_MAX_RESPONSIVENESS to get response quickly
// set timeout to 1000ms
errbuf[PCAP_ERRBUF_SIZE];
pcap_t* fp = pcap_open(d->name, 65536, PCAP_OPENFLAG_PROMISCUOUS|PCAP_OPENFLAG_MAX_RESPONSIVENESS, 1000, NULL, errbuf);
// third, get the interface device then release all the device
pcap_freealldevs(m_alldevs);
// 4th, send data
// unsigned char* buf;
// int size;
pcap_sendpacket(fp, buf, size);
And for the packet, the packet is handcrafted, with size between 64 and 1500, has an IEEE 802.3 type frame header, the two mac fields are customized.
On the machine that has the error, the version of the Winpcap is "4.1.0.2980", Wireshark is "64bit 1.12.3"; I will check the other machine that does not have the error tomorrow.
Edit:
[2016.1.26 10:30]
The version of the Winpcap is "4.1.0.2980", the same as on the machine with error. The version of Wireshark is "64bit 1.12.8". Both OS are Win7 Enterprise 64bit.

I had the same problem.
My steps to resolve it:
uninstall winpcap & npcap. I have both on my local machine
install only npcap
use delayed dll loading according to https://nmap.org/npcap/guide/npcap-devguide.html section "For software that want to use Npcap first when Npcap and WinPcap coexist".

Failed to allocate huge images in OpenCV

I'm working since a year on image processing with OpenCV 2.2.0.
I get a memory allocation error ONLY if I try to allocate a >2GB IplImage, given that the same allocation with CvMat works. I can allocate whatever I want using CvMat, I tried also >10 GB.
OpenCV was 64-bit compiled and also this simple application. Furthermore, I'm sure that the application runs in 64-bit mode as I can see from the Task Manager. The O.S. (Windows 7) is 64-bit too.
int main(int argc, char* argv[])
{
printf("trying to allocate >2GB matrix...\n");
CvMat *huge_matrix = cvCreateMat(40000,30000,CV_16UC1);
cvSet(huge_matrix,cvScalar(5));
printf("...done!\n\n");
system("PAUSE");
printf("trying to allocate >2GB image...\n");
IplImage *huge_img = cvCreateImage(cvSize(40000,30000),IPL_DEPTH_16U, 1);
cvSet(huge_img,cvScalar(5));
printf("...done!\n\n");
system("PAUSE");
cvReleaseMat(&huge_matrix);
cvReleaseImage(&huge_img);
}
The error message is "Insufficient memory: in unknown function... can it be a bug?

IplImage structure does not support images bigger then 2Gb because it stores total image size in the field of int type. Even if you allocate IplImage bigger then 2Gb with some hack then other methods will not be able to process it correctly. OpenCV inherited IplImage structure from the Intel Image Processing Library so there are no chance that format will be changed.
You should use newer structures (CvMat in C interface or cv::Mat in C++ interface) to operate with huge images.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio