System panic when set multiple '1' in GICv2 GICD_ITARGETSRn in ARMv7 - multiprocessing

I am working on a AliOS-Things project with an ARMv7 SOC (2cores).
And now I want to change GICD_ITARGETSRn to distrubite one interrupt to multiple cores by setting 011b, but the system will crash when interrupt is enabled.
I think there should be something wrong in the RTOS kernel to deal with this case or I missed something in GICv2 configuration or I did something wrong in this case.
So what is the key point to distrubite 1 IRQ to multiple cores and handle it in only in one CPU, without introducing race conditions.
I am not sure if it is correct to set multiple 1s in GICD_ITARGETSRn for one interrupt.

Related

What is the difference within the compiler between debugging and running the code? (STM32)

somehow when i am running my code, it seems like one GPIO Port isn't being initialized, meanwhile if i am debugging, it is.
I am initializing two sensors:
struct MAX31856_t max31856_temperature_sensor_heater_1 = MAX31856_TPL( SPI_DEV_TPL( IO_PIN_TPL(
TEMP_SENSOR_0_CS_GPIO_Port, TEMP_SENSOR_0_CS_Pin), &spi1));
struct MAX31856_t max31856_temperature_sensor_heater_2 = MAX31856_TPL( SPI_DEV_TPL( IO_PIN_TPL(
TEMP_SENSOR_1_CS_GPIO_Port, TEMP_SENSOR_1_CS_Pin), &spi1));
Sensor Heater 1 is not getting any Information, Sensor Heater 2 is getting Informations. Now if i swap the Name of the Heaters:
struct MAX31856_t max31856_temperature_sensor_heater_2 = MAX31856_TPL( SPI_DEV_TPL( IO_PIN_TPL(
TEMP_SENSOR_0_CS_GPIO_Port, TEMP_SENSOR_0_CS_Pin), &spi1));
struct MAX31856_t max31856_temperature_sensor_heater_1 = MAX31856_TPL( SPI_DEV_TPL( IO_PIN_TPL(TEMP_SENSOR_1_CS_GPIO_Port, TEMP_SENSOR_1_CS_Pin), &spi1));
and run the code in the debugger, Sensor Heater 1 and 2 are getting Informations.
How can this happen? I was thinking about a timing problem, but since it is working in the debugger, i don't really know what to do.
Provided that you are debugging and/or running the same binary. Debugging is mostly the same as running except if you halt the processor (es breakpoints).
In that case...
some peripherals could continue to run or be halted togheder with the cpu, the behaviour is some cases can be configured. (timers, watchdog...)
some interrupts can be lost.
some hardware buffers can overflow and data can be lost (if you don't use any flow control in your IO)
How do you run the code in debug mode? Do you have breakpoints somewhere?
You (OP) are right about it being most likely a timing problem, and probably related to physical SPI transmission. Because your line of code to send/receive something over SPI has already executed in the MCU, but physically the bits and bytes are still being transmitted on the line, while MCU is already calling the next SPI function, so one of the transmissions will fail. Try adding some delay after SPI transmission code. If things work after that, then it's the timing of SPI peripheral, and you need to add a check that there is no SPI transmission already in place before you call a functions to send/receive something.
You can do while(transmission) (pseudocode, replace with actual check if SPI transmission is going on) to wait until the previous transmission ends to call the next one.

Calling local_irq_disable() from the kernel also disable local interrupts in userspace?

From the kernel I can call local_irq_disable(). To my understanding it will disable the interrupts of the current CPU. And interrupts will remain disabled until I call local_irq_enable(). Please correct me if my understanding is incorrect.
If my understanding is correct, does it mean upon calling local_irq_disable() interrupt is also disabled for a process in the user space that is running on that same CPU?
More details:
I have a process running in the user space which I want to run without affected by interrupts and context switch. As it is not possible from the user space, I thought disabling interrupt and kernel preemption for a particular CPU from kernel will help in this case. Therefore, I wrote a simple device driver to disable kernel preemption and local interrupt by using the following code,
int i = irqs_disabled();
pr_info("before interrupt disable: %d\n", i);
pr_info("module is loaded on processor: %d\n", smp_processor_id());
id = get_cpu();
message[1] = smp_processor_id() + '0';
local_irq_disable();
printk(KERN_INFO " Current CPU id is %c\n", message[1]);
printk(KERN_INFO " local_irq_disable() called, Disable local interrupts\n");
pr_info("After interrupt disable: %d\n", irqs_disabled());
output: $dmesg
[22690.997561] before interrupt disable: 0
[22690.997564] Current CPU id is 1
[22690.997565] local_irq_disable() called, Disable local interrupts
[22690.997566] After interrupt disable: 1
I think the output confirms that local_irq_disable() does disable local interrupts.
After I disable the kernel preemption and interrupts, In the userspace I use CPU_SET() to pin my process into that particular CPU. But after doing all these I'm still not getting the desired outcome. So, it seems like disabling interrupt of a particular CPU from kernel also disable interrupts for a user space process running on that CPU is not true. I'm confused.
I was looking for an answer to the above question but could not get any suitable answer.
Duration of the CPU state with disabled interrupts should be short, because it affects the whole OS. For that reason allowing user space code to be run with disabled interrupts is considered as bad practice and is not supported by the Linux kernel.
It is responsibility of the kernel module to wrap by local_irq_disable / local_irq_enable only the kernel code. Sometimes the kernel itself could "fix" incorrect usage of these functions, but that fact shouldn't be relied upon when write a module.
I have a process running in the userspace which I want to run without affected by interrupts and context switch.
Protection from the context switch could be achieved by proper setting of scheduling policy, affinity and priority of the process. That way, the scheduler will never attempt to reschedule your process. There are several questions on Stack Overflow about making a CPU to be exclusive for a selected process.
As for interrupts, they shouldn't be disabled for a user code.
If user code accesses some hardware which should have interrupts disabled, then consider moving your code into the kernel space.
If even rare interrupts badly affect on the performance of your process or its timing, then try to reconfigure Linux kernel to be "more real time". There are also some boot-time configuration options, which could help in further reducing number of interrupts on a specific core(s). See e.g. that question: Why does using taskset to run a multi-threaded Linux program on a set of isolated cores cause all threads to run on one core?.
Note, that Linux kernel is not a base for real-time OS and never intended to be. So, if no configuration and boot settings could help you, consider to choose for your application another OS, which is real time.

What is a TRAMPOLINE_ADDR for ARM and ARM64(aarch64)?

I am writing a basic check-pointing mechanism for ARM64 using PTrace in order to do so I am using some code from cryopid and I found a TRAMPOLINE_ADDR macro like the following:
#define TRAMPOLINE_ADDR 0x00800000 /* 8MB mark */ for x86
#define TRAMPOLINE_ADDR 0x00300000 /* 3MB mark */ for x86_64
So when I read about trampolines it is something related to jump statements. But my questions is from where the above values came and what would the corresponding values for the ARM and ARM64 platform.
Thank you
Just read the wikipedia page.
There is nothing magic about a trampoline or certainly a particular address, any address where you can have code that executes can hold a trampoline. there are many use cases for them...for example
say you are booting off of a flash, a spi flash, running at some safe rate so that the chip boots for all users. But you want to increase the rate of the spi flash and the spi peripheral does not allow you to change while executing code. So you would copy some code to ram, that code boosts the spi flash rate to a faster rate so you can use and/or run the flash faster, then you bounce back to running from the flash. you have bounced or trampolined off of that little bit of code in ram.
you have a chip that boots from flash, but has the ability to re-map that address space to ram for example, so you copy some code to some other ram, branch to it that little bit of trampoline code remaps the address space, then bounces you back or bounces you to where the flash is now mapped to or whatever.
you will see the gnu linker sometimes add a small trampoline, say you compile some modules as thumb and some others for arm, you no longer have to use that interwork thing, the linker takes care of cleaning this up, it may add an instruction or two to trampoline you between modes, sometimes it modifies the code to just go where it needs to sometimes it modifies the code to branch link somewhere close and that somewhere close is a trampoline.
I assume there may be a need to do the same thing for aarch64 if/when switching to that mode.
so there should be no magic. your specific application might have one or many trampolines, and the one you are interested might not even be called that, but is probably application specific, absolutely no reason why there would be one address for everyone, unless it is some very rigid operating specific (again "application specific") thing and one specific trampoline for that operating system is at some DEFINEd address.

Kernel freeze : How to debug it?

I have an embedded board with a kernel module of thousands of lines which freeze on random and complexe use case with random time. What are the solution for me to try to debug it ?
I have already try magic System Request but it does not work. I guess that the explanation is that I am in a loop or a deadlock in a code where hardware interrupt is disable ?
Thanks,
Eva.
Typically, embedded boards have a watch dog. You should enable this timer and use the watchdog user process to kick the watch dog hard ware. Use nice on the watchdog process so that higher priority tasks must relinquish the CPU. This gives clues as to the issue. If the device does not reset with a watch dog active, then it maybe that only the network or serial port has stopped communicating. Ie, the kernel has not locked up. The issue is that there is no user visible activity. The watch dog is also useful if/when this type of issue occurs in the field.
For a kernel lockup case, the lockup watchdogs kernel features maybe useful. This will work if you have an infinite loop/deadlock as speculated. However, if this is custom hardware, it is also possible that SDRAM or a peripheral device latches up and causes abnormal bus activity. This will stop the CPU from fetching proper code; obviously, it is tough for Linux to recover from this.
You can combine the watchdog with some fallow memory that is used as a trace buffer. memmap= and mem= can limit the memory used by the kernel. A driver/device using this memory can be written that saves trace points that survive a reboot. The fallow memory's ring buffer is dumped when a watchdog reset is detected on kernel boot.
It is also useful to register thread notifiers that can do a printk on context switches, if the issue is repeatable or to discover how to make the event repeatable. Once you determine a sequence of events that leads to the lockup, you can use the scope or logic analyzer to do some final diagnosis. Or, it maybe evident which peripheral is the issue at this point.
You may also set panic=-1 and reboot=... on the kernel command line. The kdump facilities are useful, if you only have a code problem.
Related: kernel trap (at web archive). This link may no longer be available, but aren't important to this answer.

TI MSP430 Interrupt Problems After UART Code Port

I am using the MSP430F2013 processor for an application, which doesn't have a UART. I need a UART, and so I used the TI's sample code "msp430x20x3_ta_uart2400.c" to emulate one using the Timer module. This all worked fine (compiled with IAR Embedded Workbench), having tested it using PuTTY to transmit characters to a development board and a loopback to echo them to the terminal.
That was a de-risking exercise, and now I've come to port that code into my application's state machine. Having done this, I'm having issues surrounding the timer interrupts and low power sleep modes. Here's the snippet of my code around the entry into the low power (sleep) mode:
// Prepare the UART to receive one byte.
prepare_receiver();
// Enter low power mode 1.
__bis_SR_register(LPM1_bits + GIE);
// Check whether the full message has been received.
if(true == get_message_complete())
{
process_event(e_euart_message_received, NULL);
}
What I'm seeing on the debugger (C-Spy) is that sometimes it will execute the bis_SR_register() line on first entry and then go to the if statement, i.e., ignoring the fact that I've asked it to go to sleep. On other occasions, when it does go to sleep when it should, the ISR triggers correctly and eventually brings me back to the if statement to continue program execution (as I'm expecting). However, if I try to step to the next statement, the application freezes on that first line, i.e., I can't advance.
I can't think of anything functionally different from TI's example that I'm doing, so I figure my problem must be something to do with how I've ported it. For example, my Timer ISR and the code I've posted here are in different compilation units - would this sort of decision have any bearing on things? I'm aware my question might be a little vague but unfortunately I can't post all of my code, so instead I'm looking for someone with MSP experience who might be able to suggest some things to look at or some potential pitfalls that I may have fallen into.
Debugging interrupts with C-Spy in Low Power Mode is going to be tricky. According to Section A.3 Debugging (C-Spy) - IAR User's Guide:
5) C-SPY can debug applications that utilize interrupts and low power modes
But there are some "gotchas" that you should be aware of that may be causing your headaches.
In particular:
14) When C-SPY has control of the device, the CPU is ON (that is, it is not in low-power mode) regardless of the settings of the low-power
mode bits in the status register. Any low-power mode conditions are
restored prior to Step or Go. Consequently, do not measure the power
consumed by the device while C-SPY has control of the device. Instead,
run your application using Go with JTAG released
19) C-SPY utilizes the system clock to control the device during
debugging. Therefore, device counters, etc., that are clocked by the
Main System Clock (MCLK) are affected when C-SPY has control of the
device. Special precautions are taken to minimize the effect upon the
Watchdog Timer. The CPU core registers are preserved. All other clock
sources (SMCLK, ACLK) and peripherals continue to operate normally
during emulation. In other words, the Flash Emulation Tool is a
partially intrusive tool.
Devices that support clock control (Emulator
→ Advanced → Clock Control) can further minimize these
effects by selecting to stop the clock(s) during debugging
24) Peripheral bits that are cleared when read during normal program
execution (that is, interrupt flags) are cleared when read while being
debugged (that is, memory dump, peripheral registers).
When using certain MSP430 devices (such as MSP430F15x, MSP430F16x,
MSP430F43x, and MSP430F44x devices), bits do not behave this way
(that is, the bits are not cleared by C-SPY read operations).
26) While single stepping with active and enabled interrupts, it can
appear that only the interrupt service routine (ISR) is active (that
is, the non-ISR code never appears to execute, and the single step
operation always stops on the first line of the ISR). However, this
behavior is correct because the device always processes an active and
enabled interrupt before processing non-ISR (that is, mainline) code.
A workaround for this behavior is, while within the ISR, to disable
the GIE bit on the stack so that interrupts are disabled after exiting
the ISR. This permits the non-ISR code to be debugged (but without
interrupts). Interrupts can later be reenabled by setting GIE in the
status register in the Register window.
On devices with the clock control emulation feature, it may be possible
to suspend a clock between single steps and delay an interrupt request
(Emulator → Advanced → Clock Control).
One thing to try is commenting out all the low power code and seeing if your UART code works like that. Then go back and try re-enabling the low power mode.
The answer to this question lies in the debugging setup and more specifically what types of breakpoints are being used. I had quite a complex series of macros that were running on program upload, which set various hooks into memory for testing purposes. These hooks relied on software breakpoints being created, which would then call functions outside of the application. I have seen no problem in using these breakpoints in normal use, however their existence means that the debugging session doesn't run in real-time (i.e., the device is under control of the host PC). This, for a reason yet not completely known to me, caused problems when trying to debug interrupts and low power modes. (I suspect that if I was to look a bit deeper, I would see the need to use clock control whilst debugging, but I'll save that for another day).
So, to solve this problem and allow me to debug my interrupt and low power mode heavy code, which I'd ported into my larger application state machine, I had to do the following:
Disable software breakpoints within IAR.They're not actually enabled by default, but if you've been doing clever things with macros like I had, you probably would've needed to enable them, since there just aren't enough hardware breakpoints available in most MSP430s (for instance, I only have two in the MSP430F2013, and C-SPY more often than not hogs one of those!). The obvious downside to this is that debugging becomes a bit more laborious, but at least it's reliable.
Remove links to .mac Macro files.In other words, if you're using macros, don't. In my case, this meant that I had to hack some state machine logic in order to force myself down a certain route (that previously the macro had been doing for me). This clearly isn't ideal, but it will allow you to debug the interrupt/low power mode code. The macros can then be re-enabled afterwards.
So it turned out that there wasn't a problem with my port after all. I'm not particularly happy with this hacky solution, but at least it's a step forward. If I have the time, I'll investigate to see if I can work out a way of using software breakpoints and add to this answer.

Resources