ESP32 Stack canary watchpoint triggered. Why? - esp32

I have a program that can encrypt and decrypt a text with Boneh-Franklin encryption. This works great on a PC, but for some reason causes a constant reboot on ESP32 with the following error message:
setup2
setup2.2
setup2.3
Guru Meditation Error: Core 1 panic'ed (Unhandled debug exception)
Debug exception reason: Stack canary watchpoint triggered (loopTask)
Core 1 register dump:
PC : 0x40083774 PS : 0x00060b36 A0 : 0x3ffb0120 A1 : 0x3ffb0060
A2 : 0x68efa751 A3 : 0x3ffb0938 A4 : 0x3ffb0720 A5 : 0xfb879c5c
A6 : 0x61b36b71 A7 : 0x0006970f A8 : 0x01709af4 A9 : 0x01709af4
A10 : 0xfaa5dfed A11 : 0x01a3ff3b A12 : 0x76651dec A13 : 0x00000001
A14 : 0x00000000 A15 : 0x04adbe74 SAR : 0x0000001e EXCCAUSE: 0x00000001
EXCVADDR: 0x00000000 LBEG : 0x400f1cc5 LEND : 0x400f1cc9 LCOUNT : 0x00000000
ELF file SHA256: 0000000000000000
I use an Arduino ESP32 environment, CONFIG_ARDUINO_LOOP_STACK_SIZE is set to 8192 in main.cpp, 8k stack should be enough to run it. It works perfectly on PC, it’s a mystery to me why not on ESP32. Can anyone help? I absolutely ran out of ideas.
For the Boneh-Franklin implementation I used this library: https://github.com/miracl/core
My own code is ~200 lines, I have uploaded it to Google Drive: https://drive.google.com/file/d/1EY0mGC2UiVNhE68b5Q0VB9JIY2Owbpxg/view?usp=sharing

Looking at the code, it isn't unreasonable that the stack will require more than 8192 bytes as a lot of big objects are allocated on the stack, both in your code and in the library, e.g.:
loop()
csprng RNG – 128 bytes
ECP2 pPublic – 264 bytes
ECP2 cipherPointU – 264 bytes
ECP privateKey – 132 bytes
encrypt(...)
ECP pointQId – 132 bytes
char[] dst – 256 bytes
BIG l – 40 bytes
FP12 theta – 532 bytes
PAIR_fexp()
FP2 X – 88 bytes
BIG x – 40 bytes
FP a, b – 2 * 44 = 88 bytes
FP12 t0, y0, y1, y2, y3 – 5 * 532 = 2660 bytes
Increase your stack size. It will likely help.

I got to this page by googling my own troubles so thought I would add my solution here
I experienced this error recently and by using a series of Serial.println's was able to trace to an infinite loop created in the code

Related

FreeRTOS watchdog timeout

Guru Meditation Error: Core 1 panic'ed (Interrupt wdt timeout on CPU1).
Core 1 register dump:
PC : 0x4008c936 PS : 0x00060735 A0 : 0x8008b8ae A1 : 0x3ffbf25c
A2 : 0x3ffba74c A3 : 0x3ffb97b8 A4 : 0x00000004 A5 : 0x00060723
A6 : 0x00060723 A7 : 0x00000001 A8 : 0x3ffb97b8 A9 : 0x00000019
A10 : 0x3ffb97b8 A11 : 0x00000019 A12 : 0x3ffc2f24 A13 : 0x00060723
A14 : 0x007bf418 A15 : 0x003fffff SAR : 0x00000010 EXCCAUSE: 0x00000006
EXCVADDR: 0x00000000 LBEG : 0x4008491d LEND : 0x40084925 LCOUNT : 0x00000027
Core 1 was running in ISR context:
EPC1 : 0x400e2af7 EPC2 : 0x00000000 EPC3 : 0x00000000 EPC4 : 0x00000000
Backtrace: 0x4008c933:0x3ffbf25c |<-CORRUPTED
#0 0x4008c933:0x3ffbf25c in vListInsert at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/freertos/list.c:183
Core 0 register dump:
PC : 0x4008cad3 PS : 0x00060035 A0 : 0x8008b4d7 A1 : 0x3ffbeb3c
A2 : 0x3ffbf418 A3 : 0xb33fffff A4 : 0x0000abab A5 : 0x00060023
A6 : 0x00060021 A7 : 0x0000cdcd A8 : 0x0000abab A9 : 0xffffffff
A10 : 0x00000000 A11 : 0x00000000 A12 : 0x3ffc2d34 A13 : 0x00000007
A14 : 0x007bf418 A15 : 0x003fffff SAR : 0x0000001a EXCCAUSE: 0x00000006
EXCVADDR: 0x00000000 LBEG : 0x00000000 LEND : 0x00000000 LCOUNT : 0x00000000
Backtrace: 0x4008cad0:0x3ffbeb3c |<-CORRUPTED
#0 0x4008cad0:0x3ffbeb3c in compare_and_set_native at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_hw_support/include/soc/compare_set.h:25
(inlined by) spinlock_acquire at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_hw_support/include/soc/spinlock.h:103
(inlined by) xPortEnterCriticalTimeout at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/freertos/port/xtensa/port.c:288
ELF file SHA256: 90689eca1e9c1ace
Alguém sabe o que pode estar a acontecer no esp32 para gerar este erro? Se alguém estiver disposto, posso disponibilizar o código.
Já verifiquei e não é nenhum tipo de erro a nível do hardware.
I came across this issue a couple of times. Sorry, my Portuguese is not perfect but I'll try to help in English.
This happens when we are trying to allocate more memory inside a function than we allowed it initially.
For example:
void task1(void *params)
{
char buffer[3000];
memset(buffer, 'm', 3000);
}
void app_main()
{
xTaskCreate(task1, "task1", 2048, NULL, 1, NULL);
}
Notice that xTaskCreate allows only 2048 bytes(This differs from vanilla freeRtos which uses word as the unit) but we are trying to use 3000 bytes
I hope this helps, if you can share a portion of the code that works with the memory, I can look more into it.

Guru Meditation Error: Core 0 panic'ed (LoadProhibited). Exception was unhandled

I am trying to run 6 tasks parallel and all the tasks run for infinite time. But when the task starts running this error comes:
Guru Meditation Error: Core 0 panic'ed (LoadProhibited). Exception was unhandled.
Core 0 register dump:
PC : 0x400868b4 PS : 0x00060033 A0 : 0x80085442 A1 : 0x3ffb0b50
0x400868b4: xTaskIncrementTick at C:/Users/preet/esp/esp-idf/components/freertos/tasks.c:3157
A2 : 0x00000001 A3 : 0x80059301 A4 : 0x00000000 A5 : 0x00000132
A6 : 0x00000003 A7 : 0x00060023 A8 : 0x00000000 A9 : 0x3ffb0b30
A10 : 0x3ffb26b8 A11 : 0x00000003 A12 : 0x00060b20 A13 : 0x00060b23
A14 : 0x3ffb6300 A15 : 0x3ffb53fc SAR : 0x00000016 EXCCAUSE: 0x0000001c
EXCVADDR: 0x80059309 LBEG : 0x00000000 LEND : 0x00000000 LCOUNT : 0x00000000
Backtrace:0x400868b1:0x3ffb0b500x4008543f:0x3ffb0b70 0x40085199:0x3ffb0b90 0x400826e9:0x3ffb0ba0 0x400e4b6b:0x3ffb6210 0x400d18ef:0x3ffb6230 0x40086462:0x3ffb6250 0x40087991:0x3ffb6270
0x400868b1: xTaskIncrementTick at C:/Users/preet/esp/esp-idf/components/freertos/tasks.c:3156
0x4008543f: xPortSysTickHandler at C:/Users/preet/esp/esp-idf/components/freertos/port/port_systick.c:167
0x40085199: _frxt_timer_int at C:/Users/preet/esp/esp-idf/components/freertos/port/xtensa/portasm.S:329
0x400826e9: _xt_lowint1 at C:/Users/preet/esp/esp-idf/components/freertos/port/xtensa/xtensa_vectors.S:1111
0x400e4b6b: cpu_ll_waiti at C:/Users/preet/esp/esp-idf/components/hal/esp32/include/hal/cpu_ll.h:183
(inlined by) esp_pm_impl_waiti at C:/Users/preet/esp/esp-idf/components/esp_pm/pm_impl.c:837
0x400d18ef: esp_vApplicationIdleHook at C:/Users/preet/esp/esp-idf/components/esp_system/freertos_hooks.c:63
0x40086462: prvIdleTask at C:/Users/preet/esp/esp-idf/components/freertos/tasks.c:3973 (discriminator 1)
0x40087991: vPortTaskWrapper at C:/Users/preet/esp/esp-idf/components/freertos/port/xtensa/port.c:131
ELF file SHA256: 142fe637d0302132
Rebooting...

Why do I get the Debug exception reason: Stack canary watchpoint triggered (main)?

I'm writing a program for esp32-wroom-32 using esp-idf-v3.0.
I'm trying to add logs, which will be saved in fatfs.
After some logs I get:
21:54:21.306 -> Debug exception reason: Stack canary watchpoint triggered (main)
21:54:21.306 -> Register dump:
21:54:21.306 -> PC : 0x40089827 PS : 0x00060b36 A0 : 0x40082179 A1 : 0x3ffd3860
21:54:21.340 -> A2 : 0x3ff40000 A3 : 0x00000033 A4 : 0x00000033 A5 : 0x00000000
21:54:21.340 -> A6 : 0x00000024 A7 : 0xff000000 A8 : 0xe37fc000 A9 : 0x0000007e
21:54:21.340 -> A10 : 0x00000000 A11 : 0xffffffff A12 : 0x00000004 A13 : 0x00000001
21:54:21.340 -> A14 : 0x00000005 A15 : 0x00000000 SAR : 0x00000004 EXCCAUSE: 0x00000001
21:54:21.340 -> EXCVADDR: 0x00000000 LBEG : 0x400014fd LEND : 0x4000150d LCOUNT : 0xfffffff6
Why does it happen to main?
FreeRTOS task stack depth
This is quite likely caused by a stack overflow in your FreeRTOS task.
Increase the stack depth
The first thing I would do is increase the depth of the stack for your FreeRTOS task. E.g., if you created your task with a stack size of configMINIMAL_STACK_SIZE, this might be as low as 768 bytes - which is not adequate for a lot of common requirements.
How much to increase the stack depth by?
It is not easy to answer this, but - in this case - it may be adequate to simply increase it until you no longer have stack overflows. If you are concerned about not needlessly wasting memory, FreeRTOS includes a mechanism to let you know how close a task has come to overflowing its stack.
Buffers and canaries
A canary is just a marker at the end of a buffer - which is checked periodically. If it is changed from its default value, it means that the program has attempted to write beyond the end of the buffer - i.e. there has been a buffer overflow.
Detection of stack overflow using canaries is enabled in ESP IDF by changing two options in configuration (under Component Config -> FreeRTOS section):
'Check for stack overflow' -> 'using canary bytes'
'Set a debug watchpoint as a stack overflow check' -> enabled
If you disable the second option, you will instead get a Guru Meditation error - a LoadProhibited exception - in the case of a stack overflow.
xTaskCreate() and stack depth
Bear in mind that the version of xTaskCreate() in the ESP IDF differs from that in the original FreeRTOS. In the original FreeRTOS, the stack depth is specified in words. In the ESP IDF, it's specified in bytes. A very important distinction!

Measuring memory access time x86

I try to measure cached / non cached memory access time and results confusing me.
Here is the code:
1 #include <stdio.h>
2 #include <x86intrin.h>
3 #include <stdint.h>
4
5 #define SIZE 32*1024
6
7 char arr[SIZE];
8
9 int main()
10 {
11 char *addr;
12 unsigned int dummy;
13 uint64_t tsc1, tsc2;
14 unsigned i;
15 volatile char val;
16
17 memset(arr, 0x0, SIZE);
18 for (addr = arr; addr < arr + SIZE; addr += 64) {
19 _mm_clflush((void *) addr);
20 }
21 asm volatile("sfence\n\t"
22 :
23 :
24 : "memory");
25
26 tsc1 = __rdtscp(&dummy);
27 for (i = 0; i < SIZE; i++) {
28 asm volatile (
29 "mov %0, %%al\n\t" // load data
30 :
31 : "m" (arr[i])
32 );
33
34 }
35 tsc2 = __rdtscp(&dummy);
36 printf("(1) tsc: %llu\n", tsc2 - tsc1);
37
38 tsc1 = __rdtscp(&dummy);
39 for (i = 0; i < SIZE; i++) {
40 asm volatile (
41 "mov %0, %%al\n\t" // load data
42 :
43 : "m" (arr[i])
44 );
45
46 }
47 tsc2 = __rdtscp(&dummy);
48 printf("(2) tsc: %llu\n", tsc2 - tsc1);
49
50 return 0;
51 }
the output:
(1) tsc: 451248
(2) tsc: 449568
I expected, that first value would be much larger because caches were invalidated by clflush in case (1).
Info about my cpu (Intel(R) Core(TM) i7 CPU Q 720 # 1.60GHz) caches:
Cache ID 0:
- Level: 1
- Type: Data Cache
- Sets: 64
- System Coherency Line Size: 64 bytes
- Physical Line partitions: 1
- Ways of associativity: 8
- Total Size: 32768 bytes (32 kb)
- Is fully associative: false
- Is Self Initializing: true
Cache ID 1:
- Level: 1
- Type: Instruction Cache
- Sets: 128
- System Coherency Line Size: 64 bytes
- Physical Line partitions: 1
- Ways of associativity: 4
- Total Size: 32768 bytes (32 kb)
- Is fully associative: false
- Is Self Initializing: true
Cache ID 2:
- Level: 2
- Type: Unified Cache
- Sets: 512
- System Coherency Line Size: 64 bytes
- Physical Line partitions: 1
- Ways of associativity: 8
- Total Size: 262144 bytes (256 kb)
- Is fully associative: false
- Is Self Initializing: true
Cache ID 3:
- Level: 3
- Type: Unified Cache
- Sets: 8192
- System Coherency Line Size: 64 bytes
- Physical Line partitions: 1
- Ways of associativity: 12
- Total Size: 6291456 bytes (6144 kb)
- Is fully associative: false
- Is Self Initializing: true
Code disassembly between two rdtscp instructions
400614: 0f 01 f9 rdtscp
400617: 89 ce mov %ecx,%esi
400619: 48 8b 4d d8 mov -0x28(%rbp),%rcx
40061d: 89 31 mov %esi,(%rcx)
40061f: 48 c1 e2 20 shl $0x20,%rdx
400623: 48 09 d0 or %rdx,%rax
400626: 48 89 45 c0 mov %rax,-0x40(%rbp)
40062a: c7 45 b4 00 00 00 00 movl $0x0,-0x4c(%rbp)
400631: eb 0d jmp 400640 <main+0x8a>
400633: 8b 45 b4 mov -0x4c(%rbp),%eax
400636: 8a 80 80 10 60 00 mov 0x601080(%rax),%al
40063c: 83 45 b4 01 addl $0x1,-0x4c(%rbp)
400640: 81 7d b4 ff 7f 00 00 cmpl $0x7fff,-0x4c(%rbp)
400647: 76 ea jbe 400633 <main+0x7d>
400649: 48 8d 45 b0 lea -0x50(%rbp),%rax
40064d: 48 89 45 e0 mov %rax,-0x20(%rbp)
400651: 0f 01 f9 rdtscp
Looks like I'am missing / misunderstand something. Could you suggest?
mov %0, %%al is so slow (one cache line per 64 clocks, or per 32 clocks on Sandybridge specifically (not Haswell or later)) that you might bottleneck on that whether or not your loads are ultimately coming from DRAM or L1D.
Only every 64-th load will miss in cache, because you're taking full advantage of spatial locality with your tiny byte-load loop. If you actually wanted to test how fast the cache can refill after flushing an L1D-sized block, you should use a SIMD movdqa loop, or just byte loads with a stride of 64. (You only need to touch one byte per cache line).
To avoid the false dependency on the old value of RAX, you should use movzbl %0, %eax. This will let Sandybridge and later (or AMD since K8) use their full load throughput of 2 loads per clock to keep the memory pipeline closer to full. Multiple cache misses can be in flight at once: Intel CPU cores have 10 LFBs (line fill buffers) for lines to/from L1D, or 16 Superqueue entries for lines from L2 to off-core. See also Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?. (Many-core Xeon chips have worse single-thread memory bandwidth than desktops/laptops.)
But your bottleneck is far worse than that!
You compiled with optimizations disabled, so your loop uses addl $0x1,-0x4c(%rbp) for the loop counter, which gives you at least a 6-cycle loop-carried dependency chain. (Store/reload store-forwarding latency + 1 cycle for the ALU add.) http://agner.org/optimize/
(Maybe even higher because of resource conflicts for the load port. i7-720 is a Nehalem microarchitecture, so there's only one load port.)
This definitely means your loop doesn't bottleneck on cache misses, and will probably run about the same speed whether you used clflush or not.
Also note that rdtsc counts reference cycles, not core clock cycles. i.e. it will always count at 1.7GHz on your 1.7GHz CPU, regardless of the CPU running slower (powersave) or faster (Turbo). Control for this with a warm-up loop.
You also didn't declare a clobber on eax, so the compiler isn't expecting your code to modify rax. You end up with mov 0x601080(%rax),%al. But gcc reloads rax from memory every iteration, and doesn't use the rax that you modify, so you aren't actually skipping around in memory like you might be if you'd compiled with optimizations.
Hint: use volatile char * if you want to get the compiler to actually load, and not optimize it to fewer wider loads. You don't need inline asm for this.

ACR122 - Card Emulation

How can I get the NFC contactless reader ACR122U to behave as a tag (card emulation mode)?
The prospectus claims that the device can do card emulation, but the SDK does not seem to provide an example or documentation for this feature.
Does anybody know how to do this?
Is there additional software required?
Please note that my target platform is MS Windows.
Thanks in advance
For "Card Emulation" or in other words, "Configure as target and wait for initiators", please refer to here: http://code.google.com/p/nfcip-java/source/browse/trunk/nfcip-java/doc/ACR122_PN53x.txt
** Command to PN532 **
0xd4 0x8c TgInitAsTarget instruction code
0x00 Acceptable modes
(0x00 = allow all, 0x01 = only allow to be
initialized as passive, 0x02 = allow DEP only)
_6 bytes (_MIFARE_)_:
0x08 0x00 SENS_RES
0x12 0x34 0x56 NFCID1
0x40 SEL_RES
_18 bytes (_Felica_)_:
0x01 0xfe 0xa2 0xa3 0xa4 0xa5 0xa6 0xa7
NFCID2
0xc0 0xc1 0xc2 0xc3 0xc4 0xc5 0xc6 0xc7
?
0xff 0xff System parameters?
0xaa 0x99 0x88 0x77 0x66 0x55 0x44 0x33 0x22 0x11
NFCID3
0x00 ?
0x00 ?
This is the response when an initiator activated this target:
** Response from PN532 **
0xd5 0x8d TgInitAsTarget response code
0x04 Mode
(0x04 = DEP, 106kbps)
Let me know if it works!
Also you can try to send the following ADPU in HEX to put the reader in "Card emulation" mode:
FF 00 00 00 27 D4 8C 00 08 00 12 34 56 40 01 FE A2 A3 A4 A5 A6 A7 C0 C1 C2 C3 C4 C5 C6 C7 FF FF AA 99 88 77 66 55 44 33 22 11 00 00
For getting the ACR122 (or rather the PN532 NFC controller chip inside it) into card emulation mode, you would do about the following:
ReadRegister:
> FF000000 08 D406 6305 630D 6338
< D507 xx yy zz 9000
Update register values:
xx = xx | 0x004; // CIU_TxAuto |= InitialRFOn
yy = yy & 0x0EF; // CIU_ManualRCV &= ~ParityDisable
zz = zz & 0x0F7; // CIU_Status2 &= ~MFCrypto1On
WriteRegister:
> FF000000 11 D408 6302 80 6303 80 6305 xx 630D yy 6338 zz
< D509 9000
SetParameters:
> FF000000 03 D412 30
< D513 9000
TgInitAsTarget
> FF000000 27 D48C 05 0400 123456 20 000000000000000000000000000000000000 00000000000000000000 00 00
< D58D xx ... 9000
Where xx should be equal to 0x08.
Communicate using a sequence of TgGetData and TgSetData commands:
> FF000000 02 D486
< D587 xx <C-APDU> 9000
Where xx is the status code (should be 0x00 for success) and C-APDU is the command sent from the reader.
> FF000000 yy D48E <R-APDU>
< D587 xx 9000
Where yy is 2 + the length of the R-APDU (response) and xx is the status code (should be 0x00 for success).
You can use LibNFC. It has example code for this.
I still never got this working properly in Windows unfortunately. You will probably have to compile libnfc for specific drivers.
Also, the ACR122u seems to be pretty poorly supported by many libraries. Apparently it's not really designed for this use. There are particular issues for card emulation too (such as the timeout). We really all need to stop by the ACR122u. I just bought what was popular and easy to get hold of but regret it now.
To future browsers/searchers coming across this: please check the compatibility section on the libnfc site and buy something that they recommend!

Resources