Trace32 Function RunTime does not work as expected - runtime

I'm trying to debug the execution flow of a piece of code from a point A to function call B.
For that purpose I'm activating some Trace graphics using a cmm script
SYStem.RESetTarget
Break.Delete
Break EcuM_Prv_StartOS
Go
WAIT !STATE.RUN() 5.s
Trace.Init
Trace.METHOD SNOOPer
Trace.Mode PC
Trace.Arm
Break RE_CS_S_SquibDrv_Reset_func
Go
WAIT !STATE.RUN() 5.s
Trace.CHART.FUNC
What I expected in the Chart graph was to see all the function calls and time spent for any function from A (EcuM_Prv_StartOS) to B (RE_CS_S_SquibDrv_Reset_func).
But instead I only see some functions in between, As I probe if which functions has been executed I attach also in the graph the window with the stackframe that effectively shows all the calls until my breakpoint in B
So I wonder whether I'm doing something wrong or simply this graph does not work as I expected, meaning showing all the execution flow of the code.
Note: The uC is a Infineon tricore TC27X ; and this core actually does not have internal TRACE capabilities. But this functionality is under the Perf TAB not the TRace TAB and the Powerview GUI is not blocking the use of these charts so I guess is usable unlike other TRACE functionalities

You have selected Trace.METHOD SNOOPer. That method means that some items (in your case the PC) are periodically sampled. That is not the suitable trace method for complex run-time analysis.
For a complex run-time analysis you need to use one of the following:
Trace.METHOD Anayzer (requires a PowerTrace and a CPU supporting offchip-trace (parallel or serial))
Trace.METHOD CAnalyzer (requires a CombiProbe and a CPU supporting offchip-trace via a tiny 4-bit trace port)
Trace.METHOD Onchip (requires a CPU supporting onchip-trace)
Since you write that your core has internal trace capabilities (so you do have probably a so called "TriCore Emulation Device") I think Trace.METHOD Onchip is what you need.
For timing measurements with an onchip trace you have to ensure that your core's onchip trace actually provides some timing information with the program flow information. For a TriCore check TimeSTamp and TImeMode in the MCDS window.
For using samples of the program counter to getting just a rough clue in which part of your target software is executed the most, I recommend the PERF command group, which is very similar to the SNOOPer.
For measuring the time between A and B where the core stops in both A and B the RunTime command might also help.

Related

Call to ExAllocatePoolWithTag never returns

I am having some issues with my virtualHBA driver on Windows Server 2016. A ran the HLK crashdump support test. 3 times out of 10 the test passed. In those 3 failing tests, the crashdump hangs at 0% while taking Complete dump, or Kernel dump or minidump.
By kernel debugging my code, I found that the call to ExAllocatePoolWithTag() for buffer allocation never actually returns.
Below is the statement which never returns.
pDeviceExtension->pcmdbuf=(struct mycmdrsp *)ExAllocatePoolWithTag(NonPagedPoolCacheAligned,pcmdqSignalSize,((ULONG)'TA1'));
I searched on the web regarding this. However, all of the found pages are focusing on this function returning NULL which in my case never returns.
Any help on how to move forward would be highly appreciated.
Thanks in advance.
You can't allocate memory in crash dump mode. You're running at HIGH_LEVEL with interrupts disabled and so you're calling this API at the wrong IRQL.
The typical solution for a hardware adapter is to set the RequestedDumpBufferSize in the PORT_CONFIGURATION_INFORMATION structure during the normal HwFindAdapter call. Then when you're called again in crash dump mode you use the CrashDumpRegion field to get your dump buffer allocation. You then need to write your own "crash dump mode only" allocator to allocate buffers out of this memory region.
It's a huge pain, especially given that it's difficult/impossible to know how much memory you're ultimately going to need. I usually calculate some minimal configuration overhead (i.e. 1 channel, 8 I/O requests at a time, etc.) and then add in a registry configurable slush. The only benefit is that the environment is stripped down so you don't need to be in your all singing, all dancing configuration.

Retrieve RISC-V processor context after execution in FPGA

I'm loading RISC-V into a Zedboard and I'm running a benchmark (provided in riscv-tools) without booting riscv-linux, in this case:
./fesvr-zynq median.riscv
It finishes without errors, giving as result the number of cycles and instret.
My problem is that I want more information, I would like to know the processor context after the execution (register bank values and memory) as well as the result given by the algorithm. Is there any way to know this from the FPGA execution? I know that it can be done with the simulator but I need to run it on FPGA.
Thank you.
Do it the same way it gives you the cycles and instret data. Check out riscv-tests/benchmarks/common/*. The code is running bare metal so you can write whatever code you want and access any of the CSRs, registers or memory, and then you can use a basic version of printf to display the information.

What is channel event system?

I am working on some project Where I have to deal with uc ATxmega128A1 , But being a beginner to a ucontrollers I want to know what is this channel event system regarding ucs.
I have referred a link http://www.atmel.com/Images/doc8071.pdf but not getting it.
The traditional way to do things the channel system can do is to use interrupts.
In the interrupt model, the CPU runs the code starting with main(), and continues usually with some loop. When an particular event occurs, such as a button being pressed, the CPU is "interrupted". The current processing is stopped, some registers are saved, and the execution jumps to some code pointed to by an interrupt vector called an interrupt handler. This code usually has instructions to save register values, and this is added automatically by the compiler.
When the interrupting code is finished, the CPU restores the values that the registers previously had and execution jumps back to the point in the main code where it was interrupted.
But this approach takes valuable CPU cycles. And some interrupt handlers don't do very much expect trigger some peripheral to take an action. Wouldn't it be great it these kinds of interrupt handlers could be avoided and have the mC have the peripherals talk directly to each other without pausing the CPU?
This is what the event channel system does. It allows peripherals to trigger each other directly without involving the CPU. The CPU continues to execute instructions while the channel system operates in parallel. This doesn't mean you can replace all interrupt handlers, though. If complicated processing is involved, you still need a handler to act. But the channel system does allow you to avoid using very simple interrupt handlers.
The paper you reference describes this in a little more detail (but assumes a lot of knowledge on the reader's part). You have to read the actual datasheet of your mC to find the exact details.

Cuda kernel launch failure

I am trying to call two kernels as shown below
for (t=0; t<=time_total; t++)
{
//kernel calls
kernel1<<<noOfBlocks,noOfThreadsPerBlock>>>(** SOME PARAMETERS **);
checkCudaError(cudaThreadSynchronize());
kernel2<<<noOfBlocks,noOfThreadsPerBlock>>>(** SOME PARAMETERS **);
checkCudaError(cudaThreadSynchronize());
}
And the structure of the second kernel is
var[index+0]=**SOME CALCULATION**
var[index+1]=**SOME CALCULATION**
var[index+2]=**SOME CALCULATION**
Now when I execute this code, checkCudaError does not report anything and the code is executed giving some output but visual studio gives the following exception
First-chance exception at 0x7640c41f in **.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0039f9c4..
First-chance exception at 0x7640c41f in **.exe: Microsoft C++ exception: cudaError_enum at memory location 0x0039f9c4..
And when I check on Nsight it says kernel 2 is having the following error
CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES
Now the problem is that var array in kernel 2 is giving some of the rows correct some are copies of other row values and some are garbage.
Also when I do this
var[index+0]=3
var[index+1]=3
var[index+2]=3
All the values of var are set to 3
A few side notes:
cudaThreadSynchronize() is deprecated in favor of cudaDeviceSynchronize().
The fact that nsight is reporting an error on the 2nd kernel launch, but your error checking code is not, leads me to believe your error checking code is broken.
Now, regarding your issue, out of resources is frequently due to a code requesting too many registers (too many registers per thread times the number of threads per threadblock requested.) Try re-compiling your code specifying -Xptxas -v to get verbose output, and then recompiling again with -maxrregcount 20 (or something like that) to try to work around this for test purposes.
If this "fixes" your problem, you may then want to consider the following:
See if there is a way you can re-order or restructure your code to reduce the register pressure
If not, then adjust your maxrregcount value upwards to approximately the highest value that will allow your code to compile and run according to the launch configurations (number of threads per block) that you care about. You may also want to benchmark your code at different levels of this setting, as it can affect occupancy. Usually if you have it set to the highest value that will compile and run, then you are limiting yourself to one threadblock per SM at execution time. This may be OK, or there may be a lower setting that is better, allowing two threadblocks per SM residency, and possibly higher performance. Only benchmarking your code will tell.

TI MSP430 Interrupt Problems After UART Code Port

I am using the MSP430F2013 processor for an application, which doesn't have a UART. I need a UART, and so I used the TI's sample code "msp430x20x3_ta_uart2400.c" to emulate one using the Timer module. This all worked fine (compiled with IAR Embedded Workbench), having tested it using PuTTY to transmit characters to a development board and a loopback to echo them to the terminal.
That was a de-risking exercise, and now I've come to port that code into my application's state machine. Having done this, I'm having issues surrounding the timer interrupts and low power sleep modes. Here's the snippet of my code around the entry into the low power (sleep) mode:
// Prepare the UART to receive one byte.
prepare_receiver();
// Enter low power mode 1.
__bis_SR_register(LPM1_bits + GIE);
// Check whether the full message has been received.
if(true == get_message_complete())
{
process_event(e_euart_message_received, NULL);
}
What I'm seeing on the debugger (C-Spy) is that sometimes it will execute the bis_SR_register() line on first entry and then go to the if statement, i.e., ignoring the fact that I've asked it to go to sleep. On other occasions, when it does go to sleep when it should, the ISR triggers correctly and eventually brings me back to the if statement to continue program execution (as I'm expecting). However, if I try to step to the next statement, the application freezes on that first line, i.e., I can't advance.
I can't think of anything functionally different from TI's example that I'm doing, so I figure my problem must be something to do with how I've ported it. For example, my Timer ISR and the code I've posted here are in different compilation units - would this sort of decision have any bearing on things? I'm aware my question might be a little vague but unfortunately I can't post all of my code, so instead I'm looking for someone with MSP experience who might be able to suggest some things to look at or some potential pitfalls that I may have fallen into.
Debugging interrupts with C-Spy in Low Power Mode is going to be tricky. According to Section A.3 Debugging (C-Spy) - IAR User's Guide:
5) C-SPY can debug applications that utilize interrupts and low power modes
But there are some "gotchas" that you should be aware of that may be causing your headaches.
In particular:
14) When C-SPY has control of the device, the CPU is ON (that is, it is not in low-power mode) regardless of the settings of the low-power
mode bits in the status register. Any low-power mode conditions are
restored prior to Step or Go. Consequently, do not measure the power
consumed by the device while C-SPY has control of the device. Instead,
run your application using Go with JTAG released
19) C-SPY utilizes the system clock to control the device during
debugging. Therefore, device counters, etc., that are clocked by the
Main System Clock (MCLK) are affected when C-SPY has control of the
device. Special precautions are taken to minimize the effect upon the
Watchdog Timer. The CPU core registers are preserved. All other clock
sources (SMCLK, ACLK) and peripherals continue to operate normally
during emulation. In other words, the Flash Emulation Tool is a
partially intrusive tool.
Devices that support clock control (Emulator
→ Advanced → Clock Control) can further minimize these
effects by selecting to stop the clock(s) during debugging
24) Peripheral bits that are cleared when read during normal program
execution (that is, interrupt flags) are cleared when read while being
debugged (that is, memory dump, peripheral registers).
When using certain MSP430 devices (such as MSP430F15x, MSP430F16x,
MSP430F43x, and MSP430F44x devices), bits do not behave this way
(that is, the bits are not cleared by C-SPY read operations).
26) While single stepping with active and enabled interrupts, it can
appear that only the interrupt service routine (ISR) is active (that
is, the non-ISR code never appears to execute, and the single step
operation always stops on the first line of the ISR). However, this
behavior is correct because the device always processes an active and
enabled interrupt before processing non-ISR (that is, mainline) code.
A workaround for this behavior is, while within the ISR, to disable
the GIE bit on the stack so that interrupts are disabled after exiting
the ISR. This permits the non-ISR code to be debugged (but without
interrupts). Interrupts can later be reenabled by setting GIE in the
status register in the Register window.
On devices with the clock control emulation feature, it may be possible
to suspend a clock between single steps and delay an interrupt request
(Emulator → Advanced → Clock Control).
One thing to try is commenting out all the low power code and seeing if your UART code works like that. Then go back and try re-enabling the low power mode.
The answer to this question lies in the debugging setup and more specifically what types of breakpoints are being used. I had quite a complex series of macros that were running on program upload, which set various hooks into memory for testing purposes. These hooks relied on software breakpoints being created, which would then call functions outside of the application. I have seen no problem in using these breakpoints in normal use, however their existence means that the debugging session doesn't run in real-time (i.e., the device is under control of the host PC). This, for a reason yet not completely known to me, caused problems when trying to debug interrupts and low power modes. (I suspect that if I was to look a bit deeper, I would see the need to use clock control whilst debugging, but I'll save that for another day).
So, to solve this problem and allow me to debug my interrupt and low power mode heavy code, which I'd ported into my larger application state machine, I had to do the following:
Disable software breakpoints within IAR.They're not actually enabled by default, but if you've been doing clever things with macros like I had, you probably would've needed to enable them, since there just aren't enough hardware breakpoints available in most MSP430s (for instance, I only have two in the MSP430F2013, and C-SPY more often than not hogs one of those!). The obvious downside to this is that debugging becomes a bit more laborious, but at least it's reliable.
Remove links to .mac Macro files.In other words, if you're using macros, don't. In my case, this meant that I had to hack some state machine logic in order to force myself down a certain route (that previously the macro had been doing for me). This clearly isn't ideal, but it will allow you to debug the interrupt/low power mode code. The macros can then be re-enabled afterwards.
So it turned out that there wasn't a problem with my port after all. I'm not particularly happy with this hacky solution, but at least it's a step forward. If I have the time, I'll investigate to see if I can work out a way of using software breakpoints and add to this answer.

Resources