Querying the set of active CUDA kernels on a GPU

Querying the set of active CUDA kernels on a GPU - debugging

Is there a way to ask the GPU (or driver) to list the set of active (or dispatched or issued) CUDA kernels on a GPU, without attaching cuda-gdb to the owning CPU process and suspending it?
I'm imagining something like pstack, where the interface might look like:
> list-cuda-kernels $pid
gpu 0: kernel_foo
gpu 0: kernel_bar
gpu 1: kernel_baz

There is no tool or API to fetch list of the currently running kernels other then cuda-gdb (or any other CUDA debugger for that matter).

Related

Calling local_irq_disable() from the kernel also disable local interrupts in userspace?

From the kernel I can call local_irq_disable(). To my understanding it will disable the interrupts of the current CPU. And interrupts will remain disabled until I call local_irq_enable(). Please correct me if my understanding is incorrect.
If my understanding is correct, does it mean upon calling local_irq_disable() interrupt is also disabled for a process in the user space that is running on that same CPU?
More details:
I have a process running in the user space which I want to run without affected by interrupts and context switch. As it is not possible from the user space, I thought disabling interrupt and kernel preemption for a particular CPU from kernel will help in this case. Therefore, I wrote a simple device driver to disable kernel preemption and local interrupt by using the following code,
int i = irqs_disabled();
pr_info("before interrupt disable: %d\n", i);
pr_info("module is loaded on processor: %d\n", smp_processor_id());
id = get_cpu();
message[1] = smp_processor_id() + '0';
local_irq_disable();
printk(KERN_INFO " Current CPU id is %c\n", message[1]);
printk(KERN_INFO " local_irq_disable() called, Disable local interrupts\n");
pr_info("After interrupt disable: %d\n", irqs_disabled());
output: $dmesg
[22690.997561] before interrupt disable: 0
[22690.997564] Current CPU id is 1
[22690.997565] local_irq_disable() called, Disable local interrupts
[22690.997566] After interrupt disable: 1
I think the output confirms that local_irq_disable() does disable local interrupts.
After I disable the kernel preemption and interrupts, In the userspace I use CPU_SET() to pin my process into that particular CPU. But after doing all these I'm still not getting the desired outcome. So, it seems like disabling interrupt of a particular CPU from kernel also disable interrupts for a user space process running on that CPU is not true. I'm confused.
I was looking for an answer to the above question but could not get any suitable answer.

Duration of the CPU state with disabled interrupts should be short, because it affects the whole OS. For that reason allowing user space code to be run with disabled interrupts is considered as bad practice and is not supported by the Linux kernel.
It is responsibility of the kernel module to wrap by local_irq_disable / local_irq_enable only the kernel code. Sometimes the kernel itself could "fix" incorrect usage of these functions, but that fact shouldn't be relied upon when write a module.
I have a process running in the userspace which I want to run without affected by interrupts and context switch.
Protection from the context switch could be achieved by proper setting of scheduling policy, affinity and priority of the process. That way, the scheduler will never attempt to reschedule your process. There are several questions on Stack Overflow about making a CPU to be exclusive for a selected process.
As for interrupts, they shouldn't be disabled for a user code.
If user code accesses some hardware which should have interrupts disabled, then consider moving your code into the kernel space.
If even rare interrupts badly affect on the performance of your process or its timing, then try to reconfigure Linux kernel to be "more real time". There are also some boot-time configuration options, which could help in further reducing number of interrupts on a specific core(s). See e.g. that question: Why does using taskset to run a multi-threaded Linux program on a set of isolated cores cause all threads to run on one core?.
Note, that Linux kernel is not a base for real-time OS and never intended to be. So, if no configuration and boot settings could help you, consider to choose for your application another OS, which is real time.

GPU usage shows zero when CUDA with PyTorch using on Windows

I have pytorch script.
import torch
torch.cuda.is_available()
# True
device=torch.device('cuda:0')
# I moved my tensors to device
But Windows Task Manager shows zero GPU (NVIDIA GTX 1050TI) usage when pytorch script running
Speed of my script is fine and if I had changing torch.device to CPU instead GPU a speed become slower, therefore cuda (GPU) is working. Why Windows Task Manager doesn't show GPU usage?
Sample of my code:
device=torch.device("cuda:0")
model=torch.load('mymodel.pth', map_location=torch.device(device))
image=Image.open('picture.png').convert('RGB')
transform=transforms.Compose([
transforms.Resize(224),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
input=transform(image)
input=torch.unsqueeze(input, 0)
input=input.to(device)
output=model(input)

Windows task manager overall utilization does not seem to include cuda usage. Make sure you select the cuda option in the graphs.
For details see: https://medium.com/#michaelceber/gpu-monitoring-on-windows-10-for-machine-learning-cuda-41088de86d65

Just calling torch.device('cuda:0') doesn't actually use the GPU. It's just an identifier for a device.
Instead, following the documentation, you should move your tensors and models to the GPU.
torch.randn((2,3), device=torch.device('cuda:0'))
# Or
tensor = torch.randn((2,3))
cuda0 = torch.device('cuda:0')
tensor.to(cuda0)

Please install GPU-Z and then you will be able to see the correct GPU load in Windows.

Inter processor Interrrupts in ARM cortex A9 ( How To write an handler for Software generated Interrupt ( ARM) in Linux? )

I read that the Software generated interrupts in ARM are used as Inter-processor interrupts. I can also see that 5 of those interrupts are already in use. I also know that ARM provides 16 Software generated interrupts.
In my application i am running a bare metal application on of the ARM-cortex cores and Linux on the other. I want to communicate some data from the core running bare metal application to the core which is running Linux. I plan to copy the data to the on chip memory ( which is shared) and I will trigger a SGI on the Core ( running linux) to indicate some data is available for it to process. Now I am able to generate the SGI from the core ( running bare-metal application ). But for handling the interrupt in the linux side, I am not sure of the SGI IRQ numbers which are free and I am also not sure whether i can use the IRQ number directly ( in general SGI are from 0-15). Does any one have an idea how to write a handler for SGI in Linux?
Edit: This is a re-wording of the above text, because the question was closed for SSCE reasons. The Cortex-A CPUs are used in multi-CPU systems. An ARM generic interrupt controller (GIC) monitors all global interrupts and dispatches them to a particular CPU. In order for individual CPUs to signal each other, a software generated interrupt (SGI) is sent from one core to the other; this uses peripheral private interrupts (PPI). This question is,
How to implement a Linux kernel driver that can receive an SGI as a PPI?

Does any one have an idea how to write a handler for SGI in Linux?
As you didn't give the Linux version, I will assume you work with the latest (or at least recent). The ARM GIC has device tree bindings. Typically, you need to specify the SGI interrupt number in a device tree node,
ipc: ipc#address {
compatible = "company,board-ipc"; /* Your driver */
reg = <address range>;
interrupts = <1 SGI 0x02>; /* SGI is your CPU interrupt. */
status = "enabled";
};
The first number in the interrupt stanza denotes a PPI. The SGI will probably be between 0-15 as this is where the SGI interrupts are routed (at least on a Cortex-A5).
Then you can just use the platform_get_irq() in your driver to get the PPI (peripheral private interrupt). I guess that address is the shared memory (physical) where you wish to do the communications; maybe reg is not appropriate, but I think it will work. This area will be remapped by the Linux MMU and you can use it with,
res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
mem = devm_ioremap_resource(dev, res);
The address in the device tree above is a hex value of the physical address. The platform_get_irq() should return an irq number which you can use with the request_irq() family of functions. Just connect this to your routine.
Edit: Unfortunately, interrupts below 16 are forbidden by the Linux irq-gic.c. For example, gic_handle_irq(), limits handler to interrupts between 16 and 1020. If SMP is enabled, then handle_IPI() is called for the interrupts of interest. gic_raise_softirq() can be used to signal an interrupt. To handle the SGI with the current Linux, smp.c needs additional enum ipi_msg_type values and code to handle these in handle_IPI(). It looks like newer kernels (3.14+ perhaps?) may add a set_ipi_handler() to smp.c to make such a modification unneeded.

I would like to add that an example of such inter-core communication can be found in TI multicore SoC's (i.e. OMAP3530). Some time ago when I was using such a mechanism, means were provided by TI. Specifically, it was the DSPLink Linux device driver which was providing such a functionality. At that time, unfortunately, it wasn't an open source solution, but maybe there is some technical paper from TI describing how it works ... Just a direction what you could investigate further :)
EDIT: In the meantime, it seems that they've made it open source. So, if that's what you are looking for, you can have a look: DSPLink and SysLink (successor of DSPLink)

who is running kernel if cpu is running processes?

Suppose in a two process environment, one process is scheduled for execution by the kernel, and it demanded for some data which is not available in the RAM. So the cpu will indicate the kernel that something is not available and the process will be suspended. Then after kernel loads the second process for execution through the CPU and start investigating about the data in secondary memory location (say virtual memory) and gets it, puts it back to main memory by a swap to the memory data which is currently inactive, and puts the process back in the ready queue for execution.
We know that everything in computer system is get manipulated by CPU only and if CPU is busy executing continuously the process code then who is executing the kernel code to perform the tasks done by kernel?
Please let me know if i am able to explain the scenario.

At any point in time, CPU (/s) will be
Running a process in User Mode.
Running on behalf of a process in Kernel Mode to execute previleged instruction or access hardware (for example when system call read / write is issued).
Running in repsonse to a hardware interrupt. i.e. running in interrupt context. (Not associated with any process in particular) and yes in kernel mode.
Running some kernel threads to serve deferred work like soft irq. (Tasklet / Softirq)
Running CPU idle thread if nothing is there to execute.
If you are in particular asking about scheduling, then
Suppose a process is running and now it has issued a read call to retrieve data from hard disk, say, then process is removed from cpu and kernel invokes schedule() functions. So here, first process issues read system call, which results in switching from user mode to kernel mode. The kernel which is running on behalf of the process prepares for the hard disk read operation and then calls schedule() function
Suppose a hardware interrupt has come, then currently running process is removed, and interrupt service handler for that interrupt begins to execute in kernel mode (obviously).
Basically, kernel runs in between user processes !!
Clear now ?
Shash

The kernel runs either as a result of a hardware interrupt, or as a result of being invoked by a process to do something. In both cases the code which was executing at that moment stops running until the kernel finishes its job.
It is similar to a function call: when function A calls function B, function A has to wait until function B is done doing what it does, and returns control to function A. You do not need multiple CPUs, or any kind of magic to accomplish this.

The CPU is not continuously executing process code. The CPU is interrupted to perform various operations. Interrupts can occur for various reasons: a resource becomes available, a previous action completes, or simply a timer goes off.
I recommend this series of videos for more in-depth information: http://academicearth.org/courses/operating-systems-and-system-programming

Windows: how to spawn threads from (NDIS) kernel driver?

Which function is recommended to spawn a new thread within NDIS5/6 context? Looking for something that is guaranteed to work at IRQL=PASSIVE (e.g. no bsods out of nothing); by a quick examination of ndis.h contents, found nothing.
Also, it is planned to use a newly spawned thread for calling upon NdisFreeMemory* family, will it be causing any problems to free allocated, but unused memory from a different thread?

Threading is outside the scope of NDIS. If you need to start a new thread, use the standard kernel routines (like PsCreateSystemThread). Note that usually timers and work items are sufficicent for most miniport needs. It is unusual for an NDIS miniport to create its own thread, although I suppose there are valid cases where it might be a fair design.
It is ok to allocate memory on one thread and free it on another.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio