I'm trying to profile our application by:
Compiling with no optimizations
Linking the c++ code with /profile and debug information.
Doing the command line profiling dance:
vsperfcmd /start:sample /output:profile
vsperfcmd /globalon
vsperfcmd /launch:application.exe /timer:50000
The profiling works, but for some reason, about 50% of the samples are not identified:
Function Name Inclusive Samples Exclusive Samples
Unknown Frame(s) 55.01% 47.51% <-- WHAT IS THIS?
_wWinMainCRTStartup 54.79% 0.00%
[mfc100u.dll] 47.95% 1.56%
__tmainCRTStartup 42.75% 0.00%
I'm guessing that it is not one function which it can't identify, but that it groups all unidentified functions into a single "function". This makes it hard to reason about it, since it will be called from many functions, and similarly calls many functions. Most of them being unrelated.
On would think that it should at least be able to figure out which module the sample was taken from?
Related
I'm trying to debug the execution flow of a piece of code from a point A to function call B.
For that purpose I'm activating some Trace graphics using a cmm script
SYStem.RESetTarget
Break.Delete
Break EcuM_Prv_StartOS
Go
WAIT !STATE.RUN() 5.s
Trace.Init
Trace.METHOD SNOOPer
Trace.Mode PC
Trace.Arm
Break RE_CS_S_SquibDrv_Reset_func
Go
WAIT !STATE.RUN() 5.s
Trace.CHART.FUNC
What I expected in the Chart graph was to see all the function calls and time spent for any function from A (EcuM_Prv_StartOS) to B (RE_CS_S_SquibDrv_Reset_func).
But instead I only see some functions in between, As I probe if which functions has been executed I attach also in the graph the window with the stackframe that effectively shows all the calls until my breakpoint in B
So I wonder whether I'm doing something wrong or simply this graph does not work as I expected, meaning showing all the execution flow of the code.
Note: The uC is a Infineon tricore TC27X ; and this core actually does not have internal TRACE capabilities. But this functionality is under the Perf TAB not the TRace TAB and the Powerview GUI is not blocking the use of these charts so I guess is usable unlike other TRACE functionalities
You have selected Trace.METHOD SNOOPer. That method means that some items (in your case the PC) are periodically sampled. That is not the suitable trace method for complex run-time analysis.
For a complex run-time analysis you need to use one of the following:
Trace.METHOD Anayzer (requires a PowerTrace and a CPU supporting offchip-trace (parallel or serial))
Trace.METHOD CAnalyzer (requires a CombiProbe and a CPU supporting offchip-trace via a tiny 4-bit trace port)
Trace.METHOD Onchip (requires a CPU supporting onchip-trace)
Since you write that your core has internal trace capabilities (so you do have probably a so called "TriCore Emulation Device") I think Trace.METHOD Onchip is what you need.
For timing measurements with an onchip trace you have to ensure that your core's onchip trace actually provides some timing information with the program flow information. For a TriCore check TimeSTamp and TImeMode in the MCDS window.
For using samples of the program counter to getting just a rough clue in which part of your target software is executed the most, I recommend the PERF command group, which is very similar to the SNOOPer.
For measuring the time between A and B where the core stops in both A and B the RunTime command might also help.
I am having some issues with my virtualHBA driver on Windows Server 2016. A ran the HLK crashdump support test. 3 times out of 10 the test passed. In those 3 failing tests, the crashdump hangs at 0% while taking Complete dump, or Kernel dump or minidump.
By kernel debugging my code, I found that the call to ExAllocatePoolWithTag() for buffer allocation never actually returns.
Below is the statement which never returns.
pDeviceExtension->pcmdbuf=(struct mycmdrsp *)ExAllocatePoolWithTag(NonPagedPoolCacheAligned,pcmdqSignalSize,((ULONG)'TA1'));
I searched on the web regarding this. However, all of the found pages are focusing on this function returning NULL which in my case never returns.
Any help on how to move forward would be highly appreciated.
Thanks in advance.
You can't allocate memory in crash dump mode. You're running at HIGH_LEVEL with interrupts disabled and so you're calling this API at the wrong IRQL.
The typical solution for a hardware adapter is to set the RequestedDumpBufferSize in the PORT_CONFIGURATION_INFORMATION structure during the normal HwFindAdapter call. Then when you're called again in crash dump mode you use the CrashDumpRegion field to get your dump buffer allocation. You then need to write your own "crash dump mode only" allocator to allocate buffers out of this memory region.
It's a huge pain, especially given that it's difficult/impossible to know how much memory you're ultimately going to need. I usually calculate some minimal configuration overhead (i.e. 1 channel, 8 I/O requests at a time, etc.) and then add in a registry configurable slush. The only benefit is that the environment is stripped down so you don't need to be in your all singing, all dancing configuration.
I'm having an issue where my application is failing a debug assertion (_CrtIsValidHeapPointer) before anything is even executed. I know this because I added a breakpoint on the first statement of my main function, and it fails the assertion before the breakpoint is reached.
Is there a way to somehow "step through" everything that happens before my main function is called? Things like static member initializations, etc.
I should note that my program is written in C++/CLI. I recently upgraded to VS2015 and am targeting the v140 toolset. The C++ libraries I'm using (ImageMagick, libsquish, and one of my own C++ libraries) have been tested individually, and I do not receive the assertion failure with these libraries, so it has to be my main application.
I haven't changed any of the code since I upgraded from VS2013, so I'm a little stumped on what is going on.
EDIT:
Here is the call stack. This is after I click "Retry" in the assertion failed window. I then get a multitude of other exceptions being thrown, but they are different each time I run the program.
> ucrtbased.dll!527a6853()
[Frames below may be incorrect and/or missing, no symbols loaded for ucrtbased.dll]
ucrtbased.dll!527a7130()
ucrtbased.dll!527a69cb()
ucrtbased.dll!527c8116()
ucrtbased.dll!527c7eb3()
ucrtbased.dll!527c7fb3()
ucrtbased.dll!527c84b0()
PathCreator.exe!_onexit(int (void)* const function) Line 268 + 0xe bytes C++
PathCreator.exe!atexit(void (void)* const function) Line 276 + 0x9 bytes C++
PathCreator.exe!std::`dynamic initializer for '_Fac_tidy_reg''() Line 65 + 0xd bytes C++
[External Code]
mscoreei.dll!7401cd87()
mscoree.dll!741fdd05()
kernel32.dll!76c33744()
ntdll.dll!7720a064()
ntdll.dll!7720a02f()
You have to debug the C runtime initialization code. Not intuitive to do because the debugger tries hard to avoid it and get you into the main() entrypoint instead. But still possible, use Debug > New Breakpoint > Function Breakpoint.
Enter _initterm for the function name, Language = C.
Press F5 and the breakpoint will hit. You should see the C runtime source code. You can now single-step through the initialization functions of your program one-by-one, every call to (**it)() executes one.
That's exactly what you asked for. But not very likely what you actually want. The odds that your code produces this error are very low. Much more likely is that one of these libraries causes this problem. They are likely to be built targeting another version of the C runtime library. And therefore have their own _initterm() function.
Having more than one copy of the C runtime library in a process is generally very unhealthy. And highly likely to generate heap corruption. If you can't locate it from the stack trace (be sure to change the Debugger Type from Auto to Mixed, always post the stack trace in an SO question) then the next thing you should strongly consider is rebuilding those libraries with the VS version you use.
I am trying to call two kernels as shown below
for (t=0; t<=time_total; t++)
{
//kernel calls
kernel1<<<noOfBlocks,noOfThreadsPerBlock>>>(** SOME PARAMETERS **);
checkCudaError(cudaThreadSynchronize());
kernel2<<<noOfBlocks,noOfThreadsPerBlock>>>(** SOME PARAMETERS **);
checkCudaError(cudaThreadSynchronize());
}
And the structure of the second kernel is
var[index+0]=**SOME CALCULATION**
var[index+1]=**SOME CALCULATION**
var[index+2]=**SOME CALCULATION**
Now when I execute this code, checkCudaError does not report anything and the code is executed giving some output but visual studio gives the following exception
First-chance exception at 0x7640c41f in **.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0039f9c4..
First-chance exception at 0x7640c41f in **.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0039f9c4..
And when I check on Nsight it says kernel 2 is having the following error
CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES
Now the problem is that var array in kernel 2 is giving some of the rows correct some are copies of other row values and some are garbage.
Also when I do this
var[index+0]=3
var[index+1]=3
var[index+2]=3
All the values of var are set to 3
A few side notes:
cudaThreadSynchronize() is deprecated in favor of cudaDeviceSynchronize().
The fact that nsight is reporting an error on the 2nd kernel launch, but your error checking code is not, leads me to believe your error checking code is broken.
Now, regarding your issue, out of resources is frequently due to a code requesting too many registers (too many registers per thread times the number of threads per threadblock requested.) Try re-compiling your code specifying -Xptxas -v to get verbose output, and then recompiling again with -maxrregcount 20 (or something like that) to try to work around this for test purposes.
If this "fixes" your problem, you may then want to consider the following:
See if there is a way you can re-order or restructure your code to reduce the register pressure
If not, then adjust your maxrregcount value upwards to approximately the highest value that will allow your code to compile and run according to the launch configurations (number of threads per block) that you care about. You may also want to benchmark your code at different levels of this setting, as it can affect occupancy. Usually if you have it set to the highest value that will compile and run, then you are limiting yourself to one threadblock per SM at execution time. This may be OK, or there may be a lower setting that is better, allowing two threadblocks per SM residency, and possibly higher performance. Only benchmarking your code will tell.
Free Pascal heaptrc keepreleased is described as "useful if you suspect that the same memory block is released twice" but is it possible to detect usage of previously freed memory (object method call of freed object) with it? If it is impossible - can it be detected with other tools?
Yes, it should do that. The idea is the following:
an used allocation has a different .sig then $AAAAAAAA or $DEADBEEF. On freemem the sig is checked (see around line 593 in trunk) against sig $AAAAAAA IF useCRC is false.
The keepreleased prevents blocks from being reused, which would change the signature to something else then $AAAAAAAA. It will print something like:
Marked memory at $12345678 released
to the file descriptor ptext. The error standard files can be set and directed using various other variables. It looks fairly complicated, but that is probably to deal with consoleless GUI applications
Some other variables (like haltonerror) govern if the application is halted on such corruption
An alternate (but very slow) way is using valgrind (fpc option -gv), but I only have run valgrind on *nix, and as said it is extremely slow, so not for very heavy processing apps.