Flash ECC algorithm on STM32L1xx

Flash ECC algorithm on STM32L1xx - algorithm

How does the flash ECC algorithm (Flash Error Correction Code) implemented on STM32L1xx work?
Background:
I want to do multiple incremental writes to a single word in program flash of a STM32L151 MCU without doing a page erase in between. Without ECC, one could set bits incrementally, e.g. first 0x00, then 0x01, then 0x03 (STM32L1 erases bits to 0 rather than to 1), etc. As the STM32L1 has 8 bit ECC per word, this method doesn't work. However, if we knew the ECC algorithm, we could easily find a short sequence of values, that could be written incrementally without violating the ECC.
We could simply try different sequences of values and see which ones work (one such sequence is 0x0000001, 0x00000101, 0x00030101, 0x03030101), but if we don't know the ECC algorithm, we can't check, whether the sequence violates the ECC, in which case error correction wouldn't work if bits would be corrupted.
[Edit] The functionality should be used to implement a simple file system using STM32L1's internal program memory. Chunks of data are tagged with a header, which contains a state. Multiple chunks can reside on a single page. The state can change over time (first 'new', then 'used', then 'deleted', etc.). The number of states is small, but it would make things significantly easier, if we could overwrite a previous state without having to erase the whole page first.

Thanks for any comments! As there are no answers so far, I'll summarize, what I found out so far (empirically and based on comments to this answer):
According to the STM32L1 datasheet "The whole non-volatile memory embeds the error correction code (ECC) feature.", but the reference manual doesn't state anything about ECC in program memory.
The datasheet is in line with what we can find out empirically when subsequentially writing multiple words to the same program mem location without erasing the page in between. In such cases some sequences of values work while others don't.
The following are my personal conclusions, based on empirical findings, limited research and comments from this thread. It's not based on official documentation. Don't build any serious work on it (I won't either)!
It seems, that the ECC is calculated and persisted per 32-bit word. If so, the ECC must have a length of at least 7 bit.
The ECC of each word is probably written to the same nonvolatile mem as the word itself. Therefore the same limitations apply. I.e. between erases, only additional bits can be set. As stark pointed out, we can only overwrite words in program mem with values that:
Only set additional bits but don't clear any bits
Have an ECC that also only sets additional bits compared to the previous ECC.
If we write a value, that only sets additional bits, but the ECC would need to clear bits (and therefore cannot be written correctly), then:
If the ECC is wrong by one bit, the error is corrected by the ECC algorithm and the written value can be read correctly. However, ECC wouldn't work anymore if another bit failed, because ECC can only correct single-bit errors.
If the ECC is wrong by more than one bit, the ECC algorithm cannot correct the error and the read value will be wrong.
We cannot (easily) find out empirically, which sequences of values can be written correctly and which can't. If a sequence of values can be written and read back correctly, we wouldn't know, whether this is due to the automatic correction of single-bit errors. This aspect is the whole reason for this question asking for the actual algorithm.
The ECC algorithm itself seems to be undocumented. Hamming code seems to be a commonly used algorithm for ECC and in AN4750 they write, that Hamming code is actually used for error correction in SRAM. The algorithm may or may not be used for STM32L1's program memory.
The STM32L1 reference manual doesn't seem to explicitely forbid multiple writes to program memory without erase, but there is no documentation stating the opposit either. In order not to use undocumented functionality, we will refrain from using such functionality in our products and find workarounds.

Interessting question.
First I have to say, that even if you find out the ECC algorithm, you can't rely on it, as it's not documented and it can be changed anytime without notice.
But to find out the algorithm seems to be possible with a reasonable amount of tests.
I would try to build tests which starts with a constant value and then clearing only one bit.
When you read the value and it's the start value, your bit can't change all necessary bits in the ECC.
Like:
for <bitIdx>=0 to 31
earse cell
write start value, like 0xFFFFFFFF & ~(1<<testBit)
clear bit <bitIdx> in the cell
read the cell
next
If you find a start value where the erase tests works for all bits, then the start value has probably an ECC of all bits set.
Edit: This should be true for any ECC, as every ECC needs always at least a difference of two bits to detect and repair, reliable one defect bit.
As the first bit difference is in the value itself, the second change needs to be in the hidden ECC-bits and the hidden bits will be very limited.
If you repeat this test with different start values, you should be able to gather enough data to prove which error correction is used.

Related

Why is non-zeroed memory only a problem with big data usage?

I was doing a graded programming assignment — an implementation of Rope data structure. The grader fed it an initial string and a series of edit operations. I did my development in C++ on a Linux machine. After testing my solution locally with small inputs (a string of ca 10 chars) I posted it to the grader, but got Segmentation Fault on one of the test cases.
I have generated a random input data with the maximum size given in the assignment specs (the string of 300k characters). I also got the Segmentation Fault locally. After a short debugging I found out that the leaves of my tree had random left and right pointers instead of NULL. After replacing the new Vertex calls with new Vertex() (the latter calls the default constructor, unlike the former which leaves the memory as-is) the code worked fine and got accepted by the grader.
This however makes me wonder — why did my code work correctly with a small input, both locally and on the grader’s machine? Is some amount of heap guaranteed to be zeroed when I run a process? Is this an artifact of some previously run program? What exactly is happening here?

Uninitialised objects can have any value. Uninitialised pointers can contain null, they can contain valid pointers by coincidence, or contain invalid pointers. It is completely undefined. Your program will behave accordingly. And it’s quite possible that memory is filled with some amount of zeroes followed by some amount of rubbish.
There may be a compiler option that will fill uninitialised variables with data that is likely to lead to a crash. More likely, there may be compiler options warning you when you use an uninitialised variable.

Difference between direct and indirect CRC

I have seen two different kinds of CRC algorithms. The one kind is called "direct" the other kind is called "non-direct" or "indirect". The code for both is a bit different. Both are able to calculate the same checksum if direct type is supplied with a converted initial value.
I can successfully run both algorithms and I know how to convert the initial value. So this is no problem.
What I couldn't find out: Why do these two algorithms exist? Is there something that one can do what the other can't? Are they redundant from the user's point of view?
UPDATE You can find a testable online implementation (and C implementations of both aglorithms) here. However these terms (or one of them) are mentioned in some more places. Like here ("direct table algorithm"), in a microcontroller reference document, in forums etc.

The "direct" is referring to how to avoid processing n zero bits at the end for an n-bit CRC.
The mathematical definition of the CRC is a division of the message with n zero bits appended to it. You can avoid the extra operations by exclusive-oring the message with the CRC before operating on it instead of after. This requires processing the initial value of the register in the normal version through the CRC, and having that be the new initial value.
Since it is not necessary, you will never see a real-world CRC algorithm doing the extra operations.
See the section "10. A Slightly Mangled Table-Driven Implementation" in the document you link for a more detailed explanation.

Packed and encrypted section in x86 reversing challenge, without tripping entropy heuristics

TASK:
I'm building a set of x86 assembly reverse engineering challenges, of which I have twenty or so already completed. They're just for fun / education.
The current challenge is one of the more advanced ones, and involves some trickery that makes it look like the EP is actually in the normal program, but it's actually packed away in another PE section.
Heres' the basic flow:
Starts out as if it were a normal MSVC++ application.
Injected a sneaky call away to a bunch of anti-debugger tricks.
If they pass, a DWORD in memory is set to 1.
Later in the program flow, it checks for that value being 1, and if it works it decrypts a small call table. If it fails, it sends them off on a wild goose chase of fake anti-debug tricks and eventually just crashes.
The call table points to the real decryption routines that decrypt the actual program code section.
The decryption routines are called, and they decrypt using a basic looped xor (C^k^n where C is ciphertext, k is a 32-bit key and n is the current data offset)
VirtualProtect is used to switch the section's protection flags from RW to RX.
Control flow is redirected to OEP, program runs.
The idea is that since they think they're in normal program flow, it makes them miss the anti-debug call and later checks. Anyway, that all works fine.
PROBLEM:
The current problem is that OllyDbg and a few other tools look at the packed section and see that it has high entropy, and throw up a warning that it's packed. The code section pointer in the PE header is correctly set, so it doesn't get this from having EP outside code - it's purely an entropy analysis thing.
QUESTION:
Is there an encryption method I can use that preserves low entropy, but is still easy to implement in x86 asm? I don't want to use a plain xor, since it's too easy, but I also don't want it to catch it as packed and give the game away.
I thought of something like a shuffler (somehow produce a keystream and use it to swap 4-byte blocks of code around), but I'm not sure that this is going to work, or even be simple.
Anyone got any ideas?

Actually, OllyDbg works like this pseudocode:
useful_bytes = number_of_bytes_in_section - count_bytes_with_values(0x00, 0x90, 0xCC)
warn about compression if useful_bytes > 0x2000 and count_bytes_with_values(0xFF, 0xE8, 0x8B, 0x89, 0x83) / useful_bytes < 0.075
So, the way to avoid that warning is to use enough bytes with the values 0xFF 0xE8 0x8B 0x89 0x83 in the compressed section.

Don't pack/encrypt your entire program code. Just encrypt a small percentage of bytes, randomly selected from your program code. If they're not decrypted, the program will soon crash if it tries to run the code anyway - and because the majority of the program is unchanged, entropy-based checks won't be set off.

What about simply reversing the bytes (from last to first)? Intel assembler instructions aren't fixed length, so this would shuffle them a little. Or you could simply rotate each byte by a fixed amount...

EDIT: Wrong guess, this is not how Olly works. See my other answer. This still applies to tools other than OllyDbg that calculates entropy.
Expanding on ninjaljs comment:
While I haven't checked, the entropy value OllyDbg calculates is likely bytewise, without context. See How to calculate the entropy of a file? for a common algorithm for doing this.
This algorithm gives that the sequence 0 1 2 ... 254 255 have the maximum entropy possible, despite being completely predictable. A sequence of random bytes between 0 and 255 would get slightly lower entropy, since it won't have exactly the same number of each possible value.
Some quick checks on uncompressed executables with pefile tells me that uncompressed x86 code has entropy of about 6.3 to 6.6. Compressed code with entropy 8.0, encoded with base64, has entropy 6.0. Thus, base64 is easily enough to stop this algorithm from finding compressed code.

Is it fastest to access a byte than a bit? Why?

The question is very straight: is it fastest to access a byte than a bit? If I store 8 booleans in a byte will it be slower when I have to compare them than if I used 8 bytes? Why?

Chances are no. The smallest addressable unit of memory in most machines today is a byte. In most cases, you can't address or access by bit.
In fact, accessing a specific bit might be even more expensive because you have to build a mask and use some logic.
EDIT:
Your question mentions "compare", I'm not sure exactly what you mean by that. But in some cases, you perform logic very efficiently on multiple booleans using bitwise operators if your booleans are densely packed into larger integer types.
As for which to use: array of bytes (with one boolean per byte), or a densely packed structure with one boolean per bit is a space-effiicency trade-off. For some applications that need to store a massive amount of bools, dense packing is better since it saves memory.

The underlying hardware that your code runs on is built to access bytes (or longer words) from memory. To read a bit, you have to read the entire byte, and then mask off the bits you don't care about, and possibly also shift to get the bit into the ones position. So the instructions to access a bit are a superset of the instructions to access a byte.

It may be faster to store the data as bits for a different reason - if you need to traverse and access many 8-bit sets of flags in a row. You will perform more ops per boolean flag, but you will traverse less memory by having it packed in fewer bytes. You will also be able to test multiple flags in a single operation, although you may be able to do this with bools to some extent as well, as long as they lie within a single machine word.
The memory latency penalty is far higher than register bit twiddling. In the end, only profiling the code on the hardware on which it will actually run will tell you which way is best.

From a hardware point of view, I would say that in general all the bit masking and other operations in the best case might occur within a single clock (resulting in no different), but that entirely depends on hardware layer that you likely won't ever know the specifics of, and as such you cannot bank on it.
It's worth pointing out that things like the .NET system.collections.bitarray uses a 32bit integer array underneath to store it's bit data. There is likely a performance reason behind this implementation (even if only in a general case that 32bit words perform above average), I would suggest reading up about the inner workings of that might be revealing.
From a coding point of view, it really depends what you're going to do with the bits afterwards. That is to say if you're going to store your data in booleans such as:
bool a0, a1, a2, a3, a4, a5, a6, a7;
And then in your code you compare them one by one (and most of them together):
if ( a0 && a1 && !a2 && a3 && !a4 && (!a5 || a6) || a7) {
...
}
Then you will find that it will be faster (and likely neater in code) to use a bit mask. But really the only time this would matter is if you're going to be running this code millions of times in a high performance or time critical environment.
I guess what I'm getting at here is that you should do whatever your coding standards say (and if you don't have any or they don't consider such details then just do what looks neatest for your application and need).
But I highly suggest trying to look around and read a blog or two explaining the inner workings of the .NET system.collections.bitarray.

This depends on the kind of processor and motherboard data bus, i.e. 32 bit data bus will compare your data faster if you collect them into "word"s rather than "bool"s or "byte"s....
This is only valid when you are writing in assembly language when you can compare each instruction how many cycles it takes .... but since you are using compiler then it is almost the same.
However, collecting booleans into words or integers will be useful in saving memory required for variables.

Computers tend to access things in words. Accessing a bit is slower because it requires more effort:
Imagine I said something to you, then said "oh change my second word to instead".
Now imagine my edit instead was "oh, change the third letter in the second word to 's'".
Which requires more thinking on your part?

Win32 EXCEPTION_INT_OVERFLOW vs EXCEPTION_INT_DIVIDE_BY_ZERO

I have a question about the EXCEPTION_INT_OVERFLOW and EXCEPTION_INT_DIVIDE_BY_ZERO exceptions.
Windows will trap the #DE errors generated by the IDIV instruction and will end up generating and SEH exception with one of those 2 codes.
The question I have is how does it differentiate between the two conditions? The information about idiv in the Intel manual indicates that it will generate #DE in both the "divide by zero" and "underflow cases".
I took a quick look at the section on the #DE error in Volume 3 of the intel manual, and the best I could gather is that the OS must be decoding the DIV instruction, loading the divisor argument, and then comparing it to zero.
That seems a little crazy to me though. Why would the chip designers not use a flag of some sort to differentiate between the 2 causes of the error? I feel like I must be missing something.
Does anyone know for sure how the OS differentiates between the 2 different causes of failure?

Your assumptions appear to be correct. The only information available on #DE is CS and EIP, which gives the instruction. Since the two status codes are different, the OS must be decoding the instruction to determine which.
I'd also suggest that the chip makers don't really need two separate interrupts for this case, since anything divided by zero is infinity, which is too big to fit into your destination register.
As for "knowing for sure" how it differentiates, all of those who do know are probably not allowed to reveal it, either to prevent people exploiting it (not entirely sure how, but jumping into kernel mode is a good place to start looking to exploit) or making assumptions based on an implementation detail that may change without notice.
Edit: Having played with kd I can at least say that on the particular version of Windows XP (32-bit) I had access to (and the processor it was running on) the nt!Ki386CheckDivideByZeroTrap interrupt handler appears to decode the ModRM value of the instruction to determine whether to return STATUS_INTEGER_DIVIDE_BY_ZERO or STATUS_INTEGER_OVERFLOW.
(Obviously this is original research, is not guaranteed by anyone anywhere, and also happens to match the deductions that can be made based on Intel's manuals.)

Zooba's answer summarizes the Windows parses the instruction to find out what to raise.
But you cannot rely on that the routine correctly chooses the code.
I observed the following on 64 bit Windows 7 with 64 bit DIV instructions:
If the operand (divisor) is a memory operand it always raises EXCEPTION_INT_DIVIDE_BY_ZERO, regardless of the argument value.
If the operand is a register and the lower dword is zero it raises EXCEPTION_INT_DIVIDE_BY_ZERO regardless if the upper half isn't zero.
Took me a day to find this out... Hope this helps.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio