Can we assign a block to a specific SM programmatically? Can we get runtime information(number of blocks or warps,a block's or warp's execution time etc.) of SM?
No.
No.
Nothing of this sort is exposed by the standard CUDA runtime and driver APIs.
Related
I have a general question about linux device driver. More often I get confused which actions are allowed or not allowed to perform in a linux device driver?
Is there any rules or kind of lookup list to follow?
for instance with the following examples, which are not allowable?
msleep(1000);
al = kmallock(sizeof(val));
printk(KERN_ALERT "faild to print\n";
ret = adc_get_val()*0.001;
In linux device driver programming it depends in which context you are. There are two contexts that need to be distinguished:
process context
IRQ context.
Sleeping can only be done while in process context or you schedule the work for later execution (there are several mechanism available to do that). This is a complex topic that cannot be described in a paragraph.
Allocating memory can sleep, it depends with which parameters/flags kmalloc is invoked.
print can always be called (once the kernel has been invoked), otherwise use early_printk.
I don't know what the function add_get_val does. It is not part of the linux kernel. And as has already been commented, float values cannot be easily used in the kernel.
I'm loading RISC-V into a Zedboard and I'm running a benchmark (provided in riscv-tools) without booting riscv-linux, in this case:
./fesvr-zynq median.riscv
It finishes without errors, giving as result the number of cycles and instret.
My problem is that I want more information, I would like to know the processor context after the execution (register bank values and memory) as well as the result given by the algorithm. Is there any way to know this from the FPGA execution? I know that it can be done with the simulator but I need to run it on FPGA.
Thank you.
Do it the same way it gives you the cycles and instret data. Check out riscv-tests/benchmarks/common/*. The code is running bare metal so you can write whatever code you want and access any of the CSRs, registers or memory, and then you can use a basic version of printf to display the information.
is there anyways to get the system time in VxWorks besides tickGet() and tickAnnounce? I want to measure the time between the task switches of a specified task but I think the precision of tickGet() is not good enough because the the two tickGet() values at the beggining and the end of taskSwitchHookAdd function is always the same!
If you are looking to try and time task switches, I would assume you need a timer at least at the microsecond (us) level.
Usually, timers/clocks this fine grained are only provided by the platform you are running on. If you are working on an embedded system, you can try and read thru the manuals for your board support package (if there is one) to see if there are any functions provided to access various timers on a board.
A more low level solution would be to figure out the processor that is running on your system and then write some simple assembly code to poll the processor's internal timebase register (TBR). This might require a bit of research on the processor you are running on, but could be easily done.
If you are running on a PPC based processor, you can use the code below to read the TBR:
loop: mftbu rx #load most significant half from TBU
mftbl ry #load least significant half from TBL
mftbu rz #load from TBU again
cmpw rz,rx #see if 'old' = 'new'
bne loop #repeat if two values read from TBU are unequal
On an x86 based processor, you might consider using the RDTSC assembly instruction to read the Time Stamp Counter (TSC). On vxWorks, pentiumALib has some library functions (pentiumTscGet64() and pentiumTscGet32()) that will make reading the TSC easier using C.
source: http://www-inteng.fnal.gov/Integrated_Eng/GoodwinDocs/pdf/Sys%20docs/PowerPC/PowerPC%20Elapsed%20Time.pdf
Good luck!
It depends on what platform you are on, but if it is x86 then you can use:
pentiumTscGet64();
What is the best way to check for successful allocation of memory when using new in a kernel call with CUDA? Is there anything similar to (nothrow) if there isn't is there a way to continue execution of the kernel, even in the event of memory allocation failure?
Thanks!
I don't think that new is officially supported on the device-side. Moreover - to my knowledge - there is no support for exceptions on the device-side, so annotations like nothrow have no effect.
What you can do in the kernel is to call malloc. Upon the failure the function just returns NULL and you can check that normally.
Do note that
device-side malloc is supported only on devices 2.0 (Fermi) and higher.
By default you have only 8MB of heap memory. If you want to have more, you need to set the higher limit through cudaDeviceSetLimit.
Further reading: CUDA C Programming Guide, v.5.0, chapter B.17 - Dynamic Global Memory Allocation
Update: Tests have shown that new seems to be supported and seems to be working the same way, i.e. returning NULL upon failure.
I want to know which core of a multicore processor initializes first when the cpu boots ? ( i mean at the bootloader level ) is the first core ? or random core ?
You want to read the local apic, which you can read about in "volume 2a":
http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html
Each processor has a corresponding local apic, in each local apic there is an apic ID register, which gets assigned a unique value at system init time.
The initial core that comes online is called the bootstrap processor (BSP), and can indeed be any physical core on the die. More info is in "volume 3a", where they talk about the bootstrap processor selection process.
Here is an excerpt from vol3a:
8.4.1 BSP and AP Processors
The MP initialization protocol defines two classes of processors: the bootstrap processor (BSP) and the application processors (APs). Following a power-up or RESET of an MP system, system hardware dynamically selects one of the processors on the system bus as the BSP. The remaining processors are designated as APs.
As part of the BSP selection mechanism, the BSP flag is set in the IA32_APIC_BASE MSR (see Figure 10-5) of the BSP, indicating that it is the BSP. This flag is cleared for all other processors.
The BSP executes the BIOS’s boot-strap code to configure the APIC environment, sets up system-wide data structures, and starts and initializes the APs. When the BSP and APs are initialized, the BSP then begins executing the operating-system initialization code.
This depends on the architecture of the processor itself. There is not really a standard for something like this. For instance the PS3 core has 9 cores one which schedules tasks to the other 8. In this case it is fair to think about it in terms of that one core processing instructions before the other 8. As far as other processors are concerned this is a more difficult thing to discern. It would be sensible to assume that the bootloader sends its instructions to the set of cores at which point whatever logic gates assign instructions to cores do so in whatever manner they always to. In most cases I know of there is not really a difference between task scheduling at boot and at any other time. The most basic task scheduling hardware will just select the first available core which is usually whichever core is considered the "first" one by the machine. But like I keep saying different machines do it differently so I would suggest finding out what core you are using and checking what that one does. Good Luck.
Each processor has its own local APIC with related local APIC ID, this one may be read from local APIC register (the same one give different ID on each processor)