cudaGetDeviceCount return 1 instead of 2 - windows

I've a gpu cluster composed of 2 Tesla M2050 and when I'm executing my code, cudaGetDeviceCount returns only 1. If I try to set the device 1 with cudaSetDevice it give me this error: invalid device ordinals. In the device manager of windows both the devices are listed. If needed this is my source code
cutilSafeCall(cudaGetDeviceCount(&num_devices));
for (device = 0; device < num_devices; device++) {
cudaDeviceProp properties;
cudaGetDeviceProperties(&properties, device);
printf("Device ID:\t%d\n", device);
printf("Device Name:\t%s\n", properties.name );
printf("Global memory:\t%d\n", properties.totalGlobalMem );
printf("Constant memory:\t%d\n", properties.totalConstMem );
printf("Warp size:\t%d\n", properties.warpSize );
}
devs=0;
ParseArguments(argc, argv);
cutilSafeCall(cudaSetDevice(devs));
any help would be appreciated
edit: output of deviceQuery.exe
deviceQuery.exe Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
There is 1 device supporting CUDA
Device 0: "Tesla M2050"
CUDA Driver Version: 5.50
CUDA Runtime Version: 4.20
CUDA Capability Major/Minor version number: 2.0
...
...
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.50, CUDA Runtime Vers ion = 4.20, NumDevs = 1, Device = Tesla M2050
PASSED
Press <Enter> to Quit...
-----------------------------------------------------------

If you have two CUDA GPUs in a single node and deviceQuery only reports one, then consider the following possibilities:
Check both GPUs are functioning correctly by running nvidia-smi, if only one is shown then check it is socketed correctly.
Check the environment variable CUDA_VISIBLE_DEVICES is not set.

Related

Rust debugging doesn't stop at the breakpoints when debugging stm32f407 via openocd and gdb

I have a problem debugging an stm32f407vet6 board and rust code.
The point of the problem is that GDB ignores breakpoints.
After setting breakpoints and executing the "continue" command in gdb, the program continues to ignore all breakpoints.
The only way to stop the program running is to cause an interrupt using the "ctrl + c" command.
After this command, the board stops its execution on the line currently being executed.
I have tried to set breakpoints on all lines where I can set them, but all the attempts are unsuccessful.
$ openocd
Open On-Chip Debugger 0.10.0 (2020-07-01) [https://github.com/sysprogs/openocd]
Licensed under GNU GPL v2
libusb1 09e75e98b4d9ea7909e8837b7a3f00dda4589dc3
For bug reports, read
http://openocd.org/doc/doxygen/bugs.html
Info : auto-selecting first available session transport "hla_swd". To override use 'transport select <transport>'.
Info : The selected transport took over low-level target control. The results might differ compared to plain JTAG/SWD
Info : Listening on port 6666 for tcl connections
Info : Listening on port 4444 for telnet connections
Info : clock speed 2000 kHz
Error: libusb_open() failed with LIBUSB_ERROR_NOT_SUPPORTED
Info : STLINK V2J35S7 (API v2) VID:PID 0483:3748
Info : Target voltage: 6.436364
Info : stm32f4x.cpu: hardware has 6 breakpoints, 4 watchpoints
Info : starting gdb server for stm32f4x.cpu on 3333
Info : Listening on port 3333 for gdb connections
$ arm-none-eabi-gdb -q target\thumbv7em-none-eabihf\debug\test_blink
Reading symbols from target\thumbv7em-none-eabihf\debug\test_blink...
(gdb) target remote :3333
Remote debugging using :3333
0x00004070 in core::ptr::read_volatile (src=0xe000e010) at C:\Users\User\.rustup\toolchains\stable-x86_64-pc-windows-msvc\lib/rustlib/src/rust\src/libcore/ptr/mod.rs:1005
1005 pub unsafe fn read_volatile<T>(src: *const T) -> T {
(gdb) load
Loading section .vector_table, size 0x1a8 lma 0x0
Loading section .text, size 0x47bc lma 0x1a8
Loading section .rodata, size 0xbf0 lma 0x4970
Start address 0x47a2, load size 21844
Transfer rate: 100 KB/sec, 5461 bytes/write.
(gdb) b main
Breakpoint 1 at 0x1f2: file src\main.rs, line 15.
(gdb) continue
Continuing.
Program received signal SIGINT, Interrupt.
0x00001530 in cortex_m::peripheral::syst::<impl cortex_m::peripheral::SYST>::has_wrapped (self=0x1000fc6c)
at C:\Users\User\.cargo\registry\src\github.com-1ecc6299db9ec823\cortex-m-0.6.3\src\peripheral/syst.rs:135
135 pub fn has_wrapped(&mut self) -> bool {
(gdb) bt
#0 0x00001530 in cortex_m::peripheral::syst::<impl cortex_m::peripheral::SYST>::has_wrapped (self=0x1000fc6c)
at C:\Users\User\.cargo\registry\src\github.com-1ecc6299db9ec823\cortex-m-0.6.3\src\peripheral/syst.rs:135
#1 0x00003450 in <stm32f4xx_hal::delay::Delay as embedded_hal::blocking::delay::DelayUs<u32>>::delay_us (self=0x1000fc6c, us=500000)
at C:\Users\User\.cargo\registry\src\github.com-1ecc6299db9ec823\stm32f4xx-hal-0.8.3\src/delay.rs:69
#2 0x0000339e in <stm32f4xx_hal::delay::Delay as embedded_hal::blocking::delay::DelayMs<u32>>::delay_ms (self=0x1000fc6c, ms=500)
at C:\Users\User\.cargo\registry\src\github.com-1ecc6299db9ec823\stm32f4xx-hal-0.8.3\src/delay.rs:32
#3 0x00000318 in test_blink::__cortex_m_rt_main () at src\main.rs:40
#4 0x000001f6 in main () at src\main.rs:15
memory.x file:
MEMORY
{
/* NOTE 1 K = 1 KiBi = 1024 bytes */
/* TODO Adjust these memory regions to match your device memory layout */
/* These values correspond to the LM3S6965, one of the few devices QEMU can emulate */
CCMRAM : ORIGIN = 0x10000000, LENGTH = 64K
RAM : ORIGIN = 0x20000000, LENGTH = 128K
FLASH : ORIGIN = 0x00000000, LENGTH = 512K
}
/* This is where the call stack will be allocated. */
/* The stack is of the full descending type. */
/* You may want to use this variable to locate the call stack and static
variables in different memory regions. Below is shown the default value */
_stack_start = ORIGIN(CCMRAM) + LENGTH(CCMRAM);
/* You can use this symbol to customize the location of the .text section */
/* If omitted the .text section will be placed right after the .vector_table
section */
/* This is required only on microcontrollers that store some configuration right
after the vector table */
/* _stext = ORIGIN(FLASH) + 0x400; */
/* Example of putting non-initialized variables into custom RAM locations. */
/* This assumes you have defined a region RAM2 above, and in the Rust
sources added the attribute `#[link_section = ".ram2bss"]` to the data
you want to place there. */
/* Note that the section will not be zero-initialized by the runtime! */
/* SECTIONS {
.ram2bss (NOLOAD) : ALIGN(4) {
*(.ram2bss);
. = ALIGN(4);
} > RAM2
} INSERT AFTER .bss;
*/
openocd.cfg file:
# Sample OpenOCD configuration for the STM32F3DISCOVERY development board
# Depending on the hardware revision you got you'll have to pick ONE of these
# interfaces. At any time only one interface should be commented out.
# Revision C (newer revision)
source [find interface/stlink.cfg]
# Revision A and B (older revisions)
# source [find interface/stlink-v2.cfg]
source [find target/stm32f4x.cfg]
# use hardware reset, connect under reset
# reset_config none separate
main.rs file:
#![no_main]
#![no_std]
#![allow(unsafe_code)]
// Halt on panic
#[allow(unused_extern_crates)] // NOTE(allow) bug rust-lang/rust#53964
extern crate panic_halt; // panic handler
use cortex_m;
use cortex_m_rt::entry;
use stm32f4xx_hal as hal;
use crate::hal::{prelude::*, stm32};
#[entry]
fn main() -> ! {
if let (Some(dp), Some(cp)) = (
stm32::Peripherals::take(),
cortex_m::peripheral::Peripherals::take(),
) {
let rcc = dp.RCC.constrain();
let clocks = rcc
.cfgr
.sysclk(168.mhz())
.freeze();
let mut delay = hal::delay::Delay::new(cp.SYST, clocks);
let gpioa = dp.GPIOA.split();
let mut l1 = gpioa.pa6.into_push_pull_output();
let mut l2 = gpioa.pa7.into_push_pull_output();
loop {
l1.set_low().unwrap();
l2.set_high().unwrap();
delay.delay_ms(500u32);
l1.set_high().unwrap();
l2.set_low().unwrap();
delay.delay_ms(500u32);
}
}
loop {}
}
Cargo.toml file:
[package]
name = "test_blink"
version = "0.1.0"
authors = ["Alex"]
edition = "2018"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
embedded-hal = "0.2"
nb = "0.1.2"
cortex-m = "0.6"
cortex-m-rt = "0.6"
# Panic behaviour, see https://crates.io/keywords/panic-impl for alternatives
panic-halt = "0.2"
cortex-m-log="0.6.2"
[dependencies.stm32f4xx-hal]
version = "0.8.3"
features = ["rt", "stm32f407"]
I am new to rust embedded and maybe I have done something wrong, but I have already tried all the options I can find on the Internet.
At first I thought it was a problem with the cortex-debug plugin for vscode and even created the issue, but the guys couldn't help me because the problem is obviously not on their side.
Debugging "C" code in cubeIDE works, so I dare to assume that the problem is somewhere in rust--gdb--openocd. Perhaps I am missing something, but unfortunately I cannot find it myself yet.
I would appreciate any resources or ideas to solve this problem.
I'm hoping you checked out this resources:
Discovery - debug
From your screen-grab of arm-none-eabi-gdb it does indeed look it it did not hit the break point.
you should have seen this message afterwards:
Note: automatically using hardware breakpoints for read-only addresses.
Breakpoint 1, main () at ...
Did you compile your source with symbols, and unoptimised?
Your config all looks right to me otherwise.

What is the meaning of this code line?? and What is the solution for the error?? I have this issue in Sniper Simulator version 7.2

What is the meaning of this code line?? and What is the solution for the error?? I have this issue in Sniper Simulator version 7.2 with Pin 3.5 on Linux Debian 4.19.67-2+deb10u1 (2019-09-20) x86_64 GNU/Linux. My gcc version is 8.3.0
Besides This is not my codes...
The code is:
IALARM* ALARM_MANAGER::GenAddress(){
string hex = "0x";
BOOL ctxt = _control_chain->NeedContext();
if (_alarm_value.compare(0, 2, hex) == 0){
//this is a raw address
return new ALARM_ADDRESS(_alarm_value,_tid,_count,ctxt,this);
}
if (_alarm_value.find("+",0) == string::npos){
//this is a symbol
return new ALARM_SYMBOL(_alarm_value,_tid,_count,ctxt,this);
}
else{
vector<string> tokens;
PARSER::SplitArgs("+",_alarm_value,tokens);
return new ALARM_IMAGE(tokens[0],tokens[1],_tid,_count,ctxt,this);
}
The error is:
alarm_manager.cpp:137:67: error: ‘new’ of type ‘CONTROLLER::ALARM_SYMBOL’ with extended alignment 64 [-Werror=aligned-new=]
return new ALARM_SYMBOL(_alarm_value,_tid,_count,ctxt,this);
^
alarm_manager.cpp:157:64: note: uses ‘void* operator new(size_t)’, which does not have an alignment parameter
alarm_manager.cpp:157:64: note: use ‘-faligned-new’ to enable C++17 over-aligned new support
As Kamil said it gets solved by adding -faligned-new to the relative makefile.
use ‘-faligned-new’ to enable C++17 over-aligned new support

Can't put GPU as Theano device in Windows

My pc specs are a GPU NVIDIA 1050ti in a windows 10.
I installed the CUDA drivers (had to install visual studio with c++ tools) and run the tests, created a new conda environment with with theano and pygpu (python 3.6.3). Run the script:
from theano import function, config, shared, sandbox
import theano.tensor as T
import numpy
import time
print(config.device)
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], T.exp(x))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in range(iters):
r = f()
t1 = time.time()
print('Looping %d times took' % iters, t1 - t0, 'seconds')
print('Result is', r)
if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
print('Used the cpu')
else:
print('Used the gpu')
Everything worked (with the cpu running).
Then I created a .theanorc.txt file and added to it:
#!sh
[global]
device = cpu
floatX = float32
I tried to run again the script and the result was.
5000 lines of some code followed by
nvcc fatal : Cannot find compiler 'cl.exe' in PATH
['nvcc', '-shared', '-O3', '-Xlinker', '/DEBUG', '-D HAVE_ROUND', '-m64', '-Xcompiler', '-DCUDA_NDARRAY_CUH=mc72d035fdf91890f3b36710688069b2e,-DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION,/Zi,/MD', '-I"C:\\Users\\JoaquimFerrer\\Anaconda3\\envs\\neural_nets\\lib\\site-packages\\theano\\sandbox\\cuda"', '-I"C:\\Users\\JoaquimFerrer\\Anaconda3\\envs\\neural_nets\\lib\\site-packages\\numpy\\core\\include"', '-I"C:\\Users\\JoaquimFerrer\\Anaconda3\\envs\\neural_nets\\include"', '-I"C:\\Users\\JoaquimFerrer\\Anaconda3\\envs\\neural_nets\\lib\\site-packages\\theano\\gof"', '-L"C:\\Users\\JoaquimFerrer\\Anaconda3\\envs\\neural_nets\\libs"', '-L"C:\\Users\\JoaquimFerrer\\Anaconda3\\envs\\neural_nets"', '-o', 'C:\\Users\\JoaquimFerrer\\AppData\\Local\\Theano\\compiledir_Windows-10-10.0.16299-SP0-Intel64_Family_6_Model_158_Stepping_9_GenuineIntel-3.6.3-64\\cuda_ndarray\\cuda_ndarray.pyd', 'mod.cu', '-lcublas', '-lpython36', '-lcudart']
ERROR (theano.sandbox.cuda): Failed to compile cuda_ndarray.cu: ('nvcc return status', 1, 'for cmd', 'nvcc -shared -O3 -Xlinker /DEBUG -D HAVE_ROUND -m64 -Xcompiler -DCUDA_NDARRAY_CUH=mc72d035fdf91890f3b36710688069b2e,-DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION,/Zi,/MD -I"C:\\Users\\JoaquimFerrer\\Anaconda3\\envs\\neural_nets\\lib\\site-packages\\theano\\sandbox\\cuda" -I"C:\\Users\\JoaquimFerrer\\Anaconda3\\envs\\neural_nets\\lib\\site-packages\\numpy\\core\\include" -I"C:\\Users\\JoaquimFerrer\\Anaconda3\\envs\\neural_nets\\include" -I"C:\\Users\\JoaquimFerrer\\Anaconda3\\envs\\neural_nets\\lib\\site-packages\\theano\\gof" -L"C:\\Users\\JoaquimFerrer\\Anaconda3\\envs\\neural_nets\\libs" -L"C:\\Users\\JoaquimFerrer\\Anaconda3\\envs\\neural_nets" -o C:\\Users\\JoaquimFerrer\\AppData\\Local\\Theano\\compiledir_Windows-10-10.0.16299-SP0-Intel64_Family_6_Model_158_Stepping_9_GenuineIntel-3.6.3-64\\cuda_ndarray\\cuda_ndarray.pyd mod.cu -lcublas -lpython36 -lcudart')
WARNING (theano.sandbox.cuda): The cuda backend is deprecated and will be removed in the next release (v0.10). Please switch to the gpuarray backend. You can get more information about how to switch at this URL:
https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29
WARNING (theano.sandbox.cuda): CUDA is installed, but device gpu is not available (error: cuda unavailable)
gpu
and then the output of the script running with cpu.
I added cl to the path and now I can run it in the console but the output didn't change.
Can someone help? Even with installing everything in a totally new way.

theano fails to compile cuda but the python code runs using GPU

I am trying to run a theano simple code on Ubuntu 16.04 with Cuda 8.0 on NVIDIA 1060 GPU within a python virtual environment created by anaconda. The following is my theanorc file:
[global]
floatX = float32
device = cuda
The code I am trying to run is a short sample from theano website:
from theano import function, config, shared, tensor
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], tensor.exp(x))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in range(iters):
r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (r,))
if numpy.any([isinstance(x.op, tensor.Elemwise) and
('Gpu' not in type(x.op).__name__)
for x in f.maker.fgraph.toposort()]):
print('Used the cpu')
else:
print('Used the gpu')
And when I run the code I get a bunch of warnings and the following error:
ERROR (theano.sandbox.cuda): Failed to compile cuda_ndarray.cu: ('nvcc return status', 1, 'for cmd', 'nvcc -shared -O3 -m64 -Xcompiler -DCUDA_NDARRAY_CUH=c72d035fdf91890f3b36710688069b2e,-DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION,-fPIC,-fvisibility=hidden -Xlinker -rpath,/home/eb/.theano/compiledir_Linux-4.8--generic-x86_64-with-debian-stretch-sid-x86_64-2.7.13-64/cuda_ndarray -I/home/eb/anaconda2/envs/deep/lib/python2.7/site-packages/theano/sandbox/cuda -I/home/eb/anaconda2/envs/deep/lib/python2.7/site-packages/numpy/core/include -I/home/eb/anaconda2/envs/deep/include/python2.7 -I/home/eb/anaconda2/envs/deep/lib/python2.7/site-packages/theano/gof -L/home/eb/anaconda2/envs/deep/lib -o /home/eb/.theano/compiledir_Linux-4.8--generic-x86_64-with-debian-stretch-sid-x86_64-2.7.13-64/cuda_ndarray/cuda_ndarray.so mod.cu -lcublas -lpython2.7 -lcudart')
Can not use cuDNN on context None: cannot compile with cuDNN. We got this error:
/tmp/try_flags_M8OZOh.c:4:19: fatal error: cudnn.h: No such file or directory
compilation terminated.
Mapped name None to device cuda: GeForce GTX 1060 6GB (0000:01:00.0)
Surprisingly, the code RUNS and print the desired output as follows:
[GpuElemwise{exp,no_inplace}(<GpuArrayType<None>(float32, (False,))>), HostFromGpu(gpuarray)(GpuElemwise{exp,no_inplace}.0)]
Looping 1000 times took 0.365814 seconds
Result is [ 1.23178029 1.61879349 1.52278066 ..., 2.20771813 2.29967761
1.62323296]
Used the gpu
I was wondering if I am missing a Theano config or something? Any idea on what's going wrong?
p.s. All libraries have been installed in my python virtual env except for Cuda library which is installed on system level.
--Thanks

linux systemtap register error

I use systematap to probe slab memory allocation activity.
#! /usr/bin/env stap
global slabs
probe vm.kmem_cache_alloc {
slabs [execname(), bytes_req]<<<1
}
probe timer.ms(10000)
{
dummy = "";
foreach ([name, bytes] in slabs) {
if (dummy != name)
printf("\nProcess:%s\n", name);
printf("Slab_size:%d\tCount:%d\n", bytes, #count(slabs[name, bytes]));
dummy = name;
}
delete slabs
printf("\n-------------------------------------------------------\n\n")
}
but the stap produce following errors :
[root#svr_test5 ~]# stap -v -u vm.tracepoints.stp
Pass 1: parsed user script and 85 library script(s) using 146832virt/23712res/3012shr/21396data kb, in 140usr/10sys/152real ms.
Pass 2: analyzed script: 3 probe(s), 111 function(s), 3 embed(s), 13 global(s) using 228472virt/45000res/4760shr/41696data kb, in 300usr/150sys/488real ms.
Pass 3: translated to C into "/tmp/stap7FrdOq/stap_1d0a8db65ecd4c9f56be318001d197c0_39617_src.c" using 226240virt/47000res/6800shr/41696data kb, in 10usr/0sys/36real ms.
Pass 4: compiled C into "stap_1d0a8db65ecd4c9f56be318001d197c0_39617.ko" in 1360usr/160sys/1546real ms.
Pass 5: starting run.
WARNING: probe kernel.function("kmem_cache_alloc#mm/slab.c:3269").call (address 0xffffffff8000ac24) registration error (rc -84)
WARNING: probe kernel.function("kmem_cache_alloc#mm/slab.c:3269").return (address 0xffffffff8000ac24) registration error (rc -84)
which I guess the probe kernel module should be not registered, so have no effective.
My os :
CentOS release 5.8 (Final)
kernel :
Linux svr_test5 2.6.18-308.el5 #1 SMP Tue Feb 21 20:06:06 EST 2012 x86_64 x86_64 x86_64 GNU/Linux
so, what's the WARNING meaning ? how to fix it ?
WARNING: probe [...] registration error (rc -84)
This is an indication of a kernel kprobe error EILSEQ, which is issued when the kernel is unable to decode/confirm the binary instruction sequence at the requested address.
For systemtap 1.8 (last version officially updated for RHEL5) against a RHEL5.11 kernel (2.6.18-400), it happens to work; perhaps kprobes improvements did the job.

Resources