Cuda kernel launch failure - visual-studio-2010

Cuda kernel launch failure - visual-studio-2010

I am trying to call two kernels as shown below
for (t=0; t<=time_total; t++)
{
//kernel calls
kernel1<<<noOfBlocks,noOfThreadsPerBlock>>>(** SOME PARAMETERS **);
checkCudaError(cudaThreadSynchronize());
kernel2<<<noOfBlocks,noOfThreadsPerBlock>>>(** SOME PARAMETERS **);
checkCudaError(cudaThreadSynchronize());
}
And the structure of the second kernel is
var[index+0]=**SOME CALCULATION**
var[index+1]=**SOME CALCULATION**
var[index+2]=**SOME CALCULATION**
Now when I execute this code, checkCudaError does not report anything and the code is executed giving some output but visual studio gives the following exception
First-chance exception at 0x7640c41f in **.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0039f9c4..
First-chance exception at 0x7640c41f in **.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0039f9c4..
And when I check on Nsight it says kernel 2 is having the following error
CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES
Now the problem is that var array in kernel 2 is giving some of the rows correct some are copies of other row values and some are garbage.
Also when I do this
var[index+0]=3
var[index+1]=3
var[index+2]=3
All the values of var are set to 3

A few side notes:
cudaThreadSynchronize() is deprecated in favor of cudaDeviceSynchronize().
The fact that nsight is reporting an error on the 2nd kernel launch, but your error checking code is not, leads me to believe your error checking code is broken.
Now, regarding your issue, out of resources is frequently due to a code requesting too many registers (too many registers per thread times the number of threads per threadblock requested.) Try re-compiling your code specifying -Xptxas -v to get verbose output, and then recompiling again with -maxrregcount 20 (or something like that) to try to work around this for test purposes.
If this "fixes" your problem, you may then want to consider the following:
See if there is a way you can re-order or restructure your code to reduce the register pressure
If not, then adjust your maxrregcount value upwards to approximately the highest value that will allow your code to compile and run according to the launch configurations (number of threads per block) that you care about. You may also want to benchmark your code at different levels of this setting, as it can affect occupancy. Usually if you have it set to the highest value that will compile and run, then you are limiting yourself to one threadblock per SM at execution time. This may be OK, or there may be a lower setting that is better, allowing two threadblocks per SM residency, and possibly higher performance. Only benchmarking your code will tell.

Related

Trace32 Function RunTime does not work as expected

I'm trying to debug the execution flow of a piece of code from a point A to function call B.
For that purpose I'm activating some Trace graphics using a cmm script
SYStem.RESetTarget
Break.Delete
Break EcuM_Prv_StartOS
Go
WAIT !STATE.RUN() 5.s
Trace.Init
Trace.METHOD SNOOPer
Trace.Mode PC
Trace.Arm
Break RE_CS_S_SquibDrv_Reset_func
Go
WAIT !STATE.RUN() 5.s
Trace.CHART.FUNC
What I expected in the Chart graph was to see all the function calls and time spent for any function from A (EcuM_Prv_StartOS) to B (RE_CS_S_SquibDrv_Reset_func).
But instead I only see some functions in between, As I probe if which functions has been executed I attach also in the graph the window with the stackframe that effectively shows all the calls until my breakpoint in B
So I wonder whether I'm doing something wrong or simply this graph does not work as I expected, meaning showing all the execution flow of the code.
Note: The uC is a Infineon tricore TC27X ; and this core actually does not have internal TRACE capabilities. But this functionality is under the Perf TAB not the TRace TAB and the Powerview GUI is not blocking the use of these charts so I guess is usable unlike other TRACE functionalities

You have selected Trace.METHOD SNOOPer. That method means that some items (in your case the PC) are periodically sampled. That is not the suitable trace method for complex run-time analysis.
For a complex run-time analysis you need to use one of the following:
Trace.METHOD Anayzer (requires a PowerTrace and a CPU supporting offchip-trace (parallel or serial))
Trace.METHOD CAnalyzer (requires a CombiProbe and a CPU supporting offchip-trace via a tiny 4-bit trace port)
Trace.METHOD Onchip (requires a CPU supporting onchip-trace)
Since you write that your core has internal trace capabilities (so you do have probably a so called "TriCore Emulation Device") I think Trace.METHOD Onchip is what you need.
For timing measurements with an onchip trace you have to ensure that your core's onchip trace actually provides some timing information with the program flow information. For a TriCore check TimeSTamp and TImeMode in the MCDS window.
For using samples of the program counter to getting just a rough clue in which part of your target software is executed the most, I recommend the PERF command group, which is very similar to the SNOOPer.
For measuring the time between A and B where the core stops in both A and B the RunTime command might also help.

Call to ExAllocatePoolWithTag never returns

I am having some issues with my virtualHBA driver on Windows Server 2016. A ran the HLK crashdump support test. 3 times out of 10 the test passed. In those 3 failing tests, the crashdump hangs at 0% while taking Complete dump, or Kernel dump or minidump.
By kernel debugging my code, I found that the call to ExAllocatePoolWithTag() for buffer allocation never actually returns.
Below is the statement which never returns.
pDeviceExtension->pcmdbuf=(struct mycmdrsp *)ExAllocatePoolWithTag(NonPagedPoolCacheAligned,pcmdqSignalSize,((ULONG)'TA1'));
I searched on the web regarding this. However, all of the found pages are focusing on this function returning NULL which in my case never returns.
Any help on how to move forward would be highly appreciated.
Thanks in advance.

You can't allocate memory in crash dump mode. You're running at HIGH_LEVEL with interrupts disabled and so you're calling this API at the wrong IRQL.
The typical solution for a hardware adapter is to set the RequestedDumpBufferSize in the PORT_CONFIGURATION_INFORMATION structure during the normal HwFindAdapter call. Then when you're called again in crash dump mode you use the CrashDumpRegion field to get your dump buffer allocation. You then need to write your own "crash dump mode only" allocator to allocate buffers out of this memory region.
It's a huge pain, especially given that it's difficult/impossible to know how much memory you're ultimately going to need. I usually calculate some minimal configuration overhead (i.e. 1 channel, 8 I/O requests at a time, etc.) and then add in a registry configurable slush. The only benefit is that the environment is stripped down so you don't need to be in your all singing, all dancing configuration.

getting system time in Vxworks

is there anyways to get the system time in VxWorks besides tickGet() and tickAnnounce? I want to measure the time between the task switches of a specified task but I think the precision of tickGet() is not good enough because the the two tickGet() values at the beggining and the end of taskSwitchHookAdd function is always the same!

If you are looking to try and time task switches, I would assume you need a timer at least at the microsecond (us) level.
Usually, timers/clocks this fine grained are only provided by the platform you are running on. If you are working on an embedded system, you can try and read thru the manuals for your board support package (if there is one) to see if there are any functions provided to access various timers on a board.
A more low level solution would be to figure out the processor that is running on your system and then write some simple assembly code to poll the processor's internal timebase register (TBR). This might require a bit of research on the processor you are running on, but could be easily done.
If you are running on a PPC based processor, you can use the code below to read the TBR:
loop: mftbu rx #load most significant half from TBU
mftbl ry #load least significant half from TBL
mftbu rz #load from TBU again
cmpw rz,rx #see if 'old' = 'new'
bne loop #repeat if two values read from TBU are unequal
On an x86 based processor, you might consider using the RDTSC assembly instruction to read the Time Stamp Counter (TSC). On vxWorks, pentiumALib has some library functions (pentiumTscGet64() and pentiumTscGet32()) that will make reading the TSC easier using C.
source: http://www-inteng.fnal.gov/Integrated_Eng/GoodwinDocs/pdf/Sys%20docs/PowerPC/PowerPC%20Elapsed%20Time.pdf
Good luck!

It depends on what platform you are on, but if it is x86 then you can use:
pentiumTscGet64();

VLD doesnot detect memory leak whereas CRT Debug shows memory leak

Mine is a legacy code and somewhere long time back memory leak was introduced.
I am using
_CrtSetDbgFlag ( _CRTDBG_ALLOC_MEM_DF | _CRTDBG_LEAK_CHECK_DF );
which implicitly calls _CrtDumpMemoryLeaks(); upon termination/exit from anywhere in code. I am aware of the fact that if we call _CrtDumpMemoryLeaks(); explicitly and some global object is there then memory leak will be reported. That is why not calling _CrtDumpMemoryLeaks() directly.
Get memory leaks and since I am using "new' throughout thecode exact line number and file name are not coming (inspite of declaring #define _CRTDBG_MAP_ALLOC). I have 3 options:
1. Use VLD: but it does not report any leaks.
2. Override new operator so that CRT works. But get error that new is redefined. actually since it is a huge code base, some other place has overriden it conflicts are arising
3. Use number in curly braces and use _crtbreakalloc , but that number is not stable across runs. thus cannot use this strategy as well.
Please help me resolve this issue. Any better mem leak detecting tools?? I used Valgrind on Linux as well. It also doesnot report any mem leak. Only CRT debug reports.

need explanation on why does EXCEPTION_ACCESS_VIOLATION occur

Hi I know that this error which I'm going to show can't be fixed through code. I just want to know why and how is it caused and I also know its due to JVM trying to access address space of another program.
A fatal error has been detected by the Java Runtime Environment:
EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x6dcd422a, pid=4024, tid=3900
JRE version: 6.0_14-b08
Java VM: Java HotSpot(TM) Server VM (14.0-b16 mixed mode windows-x86 )
Problematic frame:
V [jvm.dll+0x17422a]
An error report file with more information is saved as:
C:\PServer\server\bin\hs_err_pid4024.log
If you would like to submit a bug report, please visit:
http://java.sun.com/webapps/bugreport/crash.jsp

The book "modern operating systems" from Tanenbaum, which is available online here:
http://lovingod.host.sk/tanenbaum/Unix-Linux-Windows.html
Covers the topic in depth. (Chapter 4 is in Memory Management and Chapter 4.8 is on Memory segmentation). The short version:
It would be very bad if several programs on your PC could access each other's memory. Actually even within one program, even in one thread you have multiple areas of memory that must not influence one other. Usually a process has at least one memory area called the "stack" and one area called the "heap" (commonly every process has one heap + one stack per thread. There MAY be more segments, but this is implementation dependent and it does not matter for the explanation here). On the stack things like you function's arguments and your local variables are saved. On the heap are variables saved that's size and lifetime cannot be determined by the compiler at compile time (that would be in Java everything that you use the "new"-Operator on. Example:
public void bar(String hi, int myInt)
{
String foo = new String("foobar");
}
in this example are two String objects: (referenced by "foo" and "hi"). Both these objects are on the heap (you know this, because at some point both Strings were allocated using "new". And in this example 3 values are on the stack. This would be the value of "myInt", "hi" and "foo". It is important to realize that "hi" and "foo" don't really contain Strings directly, but instead they contain some id that tells them were on the heap they can find the String. (This is not as easy to explain using java because java abstracts a lot. In C "hi" and "foo" would be a Pointer which is actually just an integer which represents the address in the heap where the actual value is stored).
You might ask yourself why there is a stack and a heap anyway. Why not put everything in the same place. Explaining that unfortunately exceeds the scope of this answer. Read the book I linked ;-). The short version is that stack and heap are differently managed and the separation is done for reasons of optimization.
The size of stack and heap are limited. (On Linux execute ulimit -a and you'll get a list including "data seg size" (heap) and "stack size" (yeah... stack :-)).).
The stack is something that just grows. Like an array that gets bigger and bigger if you append more and more data. Eventually you run out of space. In this case you may end up writing in the memory area that does not belong to you anymore. And that would be extremely bad. So the operating systems notices that and stops the program if that happens. On Linux you get a "Segmenation fault" and on Windows you get an "Access violation".
In other languages like in C, you need to manage your memory manually. A tiny error can easily cause you to accidentally write into some space that does not belong to you. In Java you have "automatic memory management" which means that the JVM does all this for you. You don't need to care and that takes loads from your shoulders as a developer (it usually does. I bet there are people out there who would disagree about the "loads" part ;-)). This means that it /should/ be impossible to produce segmentation faults with java. Unfortunatelly the JVM is not perfect. Sometimes it has bugs and screws up. And then you get what you got.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio