Any byte sequences that can not be present in valid x86 code?

Any byte sequences that can not be present in valid x86 code? - gcc

Any byte sequences that can not be present in valid x86 code?
I'm looking for a byte sequence (or sequences), to inject into an x86 program compiled using GCC, that can not show up in the binary as a by product of compilation.
The reason is that I want these byte sequences to act as "labels", so that I can recognize them later during inspection.
Is it possible to construct patterns of bytes, so that, searching through the binary, these patterns will not show up except with very small probability (I prefer probability zero). In other words, I want to minimize the number of false positives!

There are sequences that today are not a valid encoding of any instruction.
Rather than digging in the opcode table present in the Intel Manual 2 you can exploit two facts of the x86 architecture:
The maximum instruction length is 15 bytes.
You can repeat prefixes.
These should also be more stable across generations than reserved opcodes.
The sequence 666666666666666666666666666666 (15 operand-size override prefixes, but any prefix will do) will generate an #UD exception because it is invalid.
For what it's worth, there is a specific instruction that fulfills the role of invalid instruction: ud2.
It's presence in a binary module is possible but its more idiomatic than an invalid encoding and it is standard, for example Linux uses it to mark a bug for if ud2 is the execution flow, the code behind it cannot be valid.
That said, if I got you right, that's not going to be useful to you.
You want to skip the process of decoding the instructions and scan the code section of the binary instead.
There is no guarantee that the code section will contain only code, for example ARM compilers generate literal pools - that's definitively uncommon on x86 though.
However the compilers usually align functions to a specific boundary (usually 16 bytes), this can be done in several ways - like stretching the previous function or with a mere padding.
This padding can be a sequence of bytes of any value - hence arbitrary bytes can be present in the code section.
Long story short, there is no universal byte sequence that appear with probability zero in the code section.
Everything that it's not in the execution flow can have any value.
We will deal with probability later, for now lets assume the 66..66h appears rarely enough in an executable.
You can't just use it directly, as 66..66h can be part of two instructions and thus be a valid sequence:
mov rax, 6666666666666666h
db 66h, 66h, 66h , 66h
db 66h, 66h, 66h
nop
is valid.
This is due to the immediate operands of instructions - the biggest immediate can be 8 bytes in length (as today), so the sequence must be lengthen to 15 + 8 = 23 bytes.
If you really want to be safe again future features, you can use a sequence of 14 + 15 = 29 bytes (for the 15-byte instruction length limit).
It's possible to find 23/29 bytes of value 66h in the code section or in the whole binary.
But how probable is that?
If the bytes in a binary were uniformly random then the probability would be astronomically small: 256-23 = 2-184.
Well, the point is that the bytes in a binary are not uniformly random.
You can open a file with an embedded icon to confirm that.
You can make the probability arbitrarily small by stretching the sequence - it's up to you to find a compromise between the length and an acceptable number of false positives.
It's unclear what you want to do but here some advice:
Most, if not all, building tools support generating a map file.
It is a file with all the symbols/names and their addresses.
If you could use actual labels (with a prefix and a random suffix) you'd collect them easily after the build.
Most output formats can be enriched with meta-information.
You can add an ELF/PE section with a table of offsets to the locations you want to mark.

Related

Flash ECC algorithm on STM32L1xx

How does the flash ECC algorithm (Flash Error Correction Code) implemented on STM32L1xx work?
Background:
I want to do multiple incremental writes to a single word in program flash of a STM32L151 MCU without doing a page erase in between. Without ECC, one could set bits incrementally, e.g. first 0x00, then 0x01, then 0x03 (STM32L1 erases bits to 0 rather than to 1), etc. As the STM32L1 has 8 bit ECC per word, this method doesn't work. However, if we knew the ECC algorithm, we could easily find a short sequence of values, that could be written incrementally without violating the ECC.
We could simply try different sequences of values and see which ones work (one such sequence is 0x0000001, 0x00000101, 0x00030101, 0x03030101), but if we don't know the ECC algorithm, we can't check, whether the sequence violates the ECC, in which case error correction wouldn't work if bits would be corrupted.
[Edit] The functionality should be used to implement a simple file system using STM32L1's internal program memory. Chunks of data are tagged with a header, which contains a state. Multiple chunks can reside on a single page. The state can change over time (first 'new', then 'used', then 'deleted', etc.). The number of states is small, but it would make things significantly easier, if we could overwrite a previous state without having to erase the whole page first.

Thanks for any comments! As there are no answers so far, I'll summarize, what I found out so far (empirically and based on comments to this answer):
According to the STM32L1 datasheet "The whole non-volatile memory embeds the error correction code (ECC) feature.", but the reference manual doesn't state anything about ECC in program memory.
The datasheet is in line with what we can find out empirically when subsequentially writing multiple words to the same program mem location without erasing the page in between. In such cases some sequences of values work while others don't.
The following are my personal conclusions, based on empirical findings, limited research and comments from this thread. It's not based on official documentation. Don't build any serious work on it (I won't either)!
It seems, that the ECC is calculated and persisted per 32-bit word. If so, the ECC must have a length of at least 7 bit.
The ECC of each word is probably written to the same nonvolatile mem as the word itself. Therefore the same limitations apply. I.e. between erases, only additional bits can be set. As stark pointed out, we can only overwrite words in program mem with values that:
Only set additional bits but don't clear any bits
Have an ECC that also only sets additional bits compared to the previous ECC.
If we write a value, that only sets additional bits, but the ECC would need to clear bits (and therefore cannot be written correctly), then:
If the ECC is wrong by one bit, the error is corrected by the ECC algorithm and the written value can be read correctly. However, ECC wouldn't work anymore if another bit failed, because ECC can only correct single-bit errors.
If the ECC is wrong by more than one bit, the ECC algorithm cannot correct the error and the read value will be wrong.
We cannot (easily) find out empirically, which sequences of values can be written correctly and which can't. If a sequence of values can be written and read back correctly, we wouldn't know, whether this is due to the automatic correction of single-bit errors. This aspect is the whole reason for this question asking for the actual algorithm.
The ECC algorithm itself seems to be undocumented. Hamming code seems to be a commonly used algorithm for ECC and in AN4750 they write, that Hamming code is actually used for error correction in SRAM. The algorithm may or may not be used for STM32L1's program memory.
The STM32L1 reference manual doesn't seem to explicitely forbid multiple writes to program memory without erase, but there is no documentation stating the opposit either. In order not to use undocumented functionality, we will refrain from using such functionality in our products and find workarounds.

Interessting question.
First I have to say, that even if you find out the ECC algorithm, you can't rely on it, as it's not documented and it can be changed anytime without notice.
But to find out the algorithm seems to be possible with a reasonable amount of tests.
I would try to build tests which starts with a constant value and then clearing only one bit.
When you read the value and it's the start value, your bit can't change all necessary bits in the ECC.
Like:
for <bitIdx>=0 to 31
earse cell
write start value, like 0xFFFFFFFF & ~(1<<testBit)
clear bit <bitIdx> in the cell
read the cell
next
If you find a start value where the erase tests works for all bits, then the start value has probably an ECC of all bits set.
Edit: This should be true for any ECC, as every ECC needs always at least a difference of two bits to detect and repair, reliable one defect bit.
As the first bit difference is in the value itself, the second change needs to be in the hidden ECC-bits and the hidden bits will be very limited.
If you repeat this test with different start values, you should be able to gather enough data to prove which error correction is used.

x86 decoding instruction opcode byte

I'm creating an x86 decoder and I'm struggling on understanding and finding an efficient way to calculate the mnemonic of an instruction.
I know that the opcode 6 MSBs are the opcode bits, but I can't find anywhere that use those 6 bits in a mnemonic table. The only mnemonic table I find is for the whole opcode byte itself and not just the 6 MSBs.
I wanted to ask what are some efficient ways I can go on decoding the mnemonics encoded in the opcode byte, and if there're any table references using the 6 MSBs and not the whole opcode byte.

But isn't there an efficient way to store a table for the mnemonics without duplicates?
This has become an algorithms and data structures question.
As you point out, many of the opcode table entries (at least for the table without a 0f escape byte: http://sparksandflames.com/files/x86InstructionChart.html) do repeat in groups of 4 or 2, i.e. with the same 6 or 7-bit prefix selecting the same mnemonic.
Obviously a 256-entry table of structs is simple, but duplicates things. It's very fast and easy to use, since it's probably still small enough not to cache-miss very often. (Especially since the common entries will stay hot in cache; x86 code uses the same opcodes a lot.)
You can trade simplicity / performance for space.
You could have a 64-entry table of structs where one member is a pointer to a secondary table to be indexed with the low 2 bits. If the pointer is NULL, it means the instruction follows the pattern of add / and / xor / etc. where the low 2 bits tell you 8 bit vs. whatever the operand-size is and direction (r/m,reg or reg,r/m).
Your struct would also need entries for turning into other instructions when certain prefixes are present (e.g. rep nop is pause). Also, AVX VEX prefixes use what used to be an invalid encoding of another instruction. x86 is pretty crazy to decode if you want to do a complete job for all the current extensions.
Actually, it might be simplest (and also efficient) to just use a table of function pointers. Or a struct with a const char* mnemonic and a int (*decode)(const char*mnemonic, const char *insn_bytes, unsigned prefix_bitmap) function, so lots of opcodes can point to the same decode-function but still get different mnemonics. Sometimes the function will ignore the passed mnemonic, but other times that's all it needs. You'd have a common function for decoding addressing modes that many of the decode functions would call.
This is fairly similar to how you might implement an x86 emulator that interprets, instead of doing dynamic recompilation. A common decode loop and then dispatching through function pointers.
An even more complicated data structure you might use is a radix trie aka prefix tree. See also https://en.wikipedia.org/wiki/Trie#Bitwise_tries.
This is getting into silly season, because the density is so high that a lookup table makes much more sense. (There are very few undefined opcode).

Are both of these algorithms valid implementations of LZSS?

I am reverse engineering things and I often stumble upon various decompression algorithms. Most of times, it's LZSS just like Wikipedia describes it:
Initialize dictionary of size 2^n
While output is less than known output size:
Read flag
If the flag is set, output literal byte (and append it at the end of dictionary)
If the flag is not set:
Read length and look behind position
Transcribe length bytes from the dictionary at look behind position to the output and at the end of dictionary.
The thing is that the implementations follow two schools of how to encode the flag. The first one treats the input as sequence of bits:
(...)
Read flag as one bit
If it's set, read literal byte as 8 unaligned bits
If it's not set, read length and position as n and m unaligned bits
This involves lots of bit shift operations.
The other one saves a little CPU time by using bitwise operations only for flag storage, whereas literal bytes, length and position are derived from aligned input bytes. To achieve this, it breaks the linearity by fetching a few flags in advance. So the algorithm is modified like this:
(...)
Read 8 flags at once by reading one byte. For each of these 8 flags:
If it's set, read literal as aligned byte
If it's not set, read length and position as aligned bytes (deriving the specific values from the fetched bytes involves some bit operations, but it's nowhere as expensive as the first version.)
My question is: are these both valid LZSS implementations, or did I identify these algorithms wrong? Are there any known names for them?

They are effectively variants on LZSS, since all use one bit to decide on literal vs. match. More generally they are variants on LZ77.
Deflate is also a variant on LZ77, which does not use a whole bit for literal vs. match. Instead deflate has a single code for the combination of literals and lengths, so the code implicitly determines whether the next thing is a literal or a match. A length code is followed by a separate distance code.
lz4 (a specific algorithm, not a family) handles byte alignment in a different way, coding the number of literals, which is necessarily followed by a match. The first byte with the number of literals also has part of the distance. The literals are byte aligned, as is the offset that follows the literals and the rest of the distance.

Efficient Algorithm for Parsing OpCodes

Let's say I'm writing a virtual machine. I read in the program data into an array of bytes. Now I need to loop through those bytes (instructions are two bytes) and instantiate a little class representing each instruction and it's arguments.
What would be a fast parsing approach? Here are the two way's I've thought of:
Logically branching by inspecting each bit from the left to the right until I narrowed it down to a particular op code. This would be like a binary search.
Inspecting some programs to come up with a list of opcodes ordered by frequency of use, and then checking the for the full opcode in that order.
Note: I will be using bit shifting and masking in C to check, not regexes or string comps or anything high-level like that.

You don't need to parse anything. If this is in C, you make a table of function pointers which has 256 entries in it, one for each possible byte value, then jump to the appropriate function based on the first byte value. If the second byte is significant then a switch statement can be used within the function to handle the second byte. This is how the original Visual Basic interpreter (versions 1-6) worked.

optimizing byte-pair encoding

Noticing that byte-pair encoding (BPE) is sorely lacking from the large text compression benchmark, I very quickly made a trivial literal implementation of it.
The compression ratio - considering that there is no further processing, e.g. no Huffman or arithmetic encoding - is surprisingly good.
The runtime of my trivial implementation was less than stellar, however.
How can this be optimized? Is it possible to do it in a single pass?

This is a summary of my progress so far:
Googling found this little report that links to the original code and cites the source:
Philip Gage, titled 'A New Algorithm
for Data Compression', that appeared
in 'The C Users Journal' - February
1994 edition.
The links to the code on Dr Dobbs site are broken, but that webpage mirrors them.
That code uses a hash table to track the the used digraphs and their counts each pass over the buffer, so as to avoid recomputing fresh each pass.
My test data is enwik8 from the Hutter Prize.
|----------------|-----------------|
| Implementation | Time (min.secs) |
|----------------|-----------------|
| bpev2 | 1.24 | //The current version in the large text benchmark
| bpe_c | 1.07 | //The original version by Gage, using a hashtable
| bpev3 | 0.25 | //Uses a list, custom sort, less memcpy
|----------------|-----------------|
bpev3 creates a list of all digraphs; the blocks are 10KB in size, and there are typically 200 or so digraphs above the threshold (of 4, which is the smallest we can gain a byte by compressing); this list is sorted and the first subsitution is made.
As the substitutions are made, the statistics are updated; typically each pass there is only around 10 or 20 digraphs changed; these are 'painted' and sorted, and then merged with the digraph list; this is substantially faster than just always sorting the whole digraph list each pass, since the list is nearly sorted.
The original code moved between a 'tmp' and 'buf' byte buffers; bpev3 just swaps buffer pointers, which is worth about 10 seconds runtime alone.
Given the buffer swapping fix to bpev2 would bring the exhaustive search in line with the hashtable version; I think the hashtable is arguable value, and that a list is a better structure for this problem.
Its sill multi-pass though. And so its not a generally competitive algorithm.
If you look at the Large Text Compression Benchmark, the original bpe has been added. Because of it's larger blocksizes, it performs better than my bpe on on enwik9. Also, the performance gap between the hash-tables and my lists is much closer - I put that down to the march=PentiumPro that the LTCB uses.
There are of course occasions where it is suitable and used; Symbian use it for compressing pages in ROM images. I speculate that the 16-bit nature of Thumb binaries makes this a straightforward and rewarding approach; compression is done on a PC, and decompression is done on the device.

I've done work with optimizing a LZF compression implementation, and some of the same principles I used to improve performance are usable here.
To speed up performance on byte-pair encoding:
Limit the block size to less than 65kB (probably 8-16 kB will be optimal). This guarantees not all bytes will be used, and allows you to hold intermediate processing info in RAM.
Use a hashtable or simple lookup table by short integer (more RAM, but faster) to hold counts for a byte pairs. There are 65,656 2-byte pairs, and BlockSize instances possible (max blocksize 64k). This gives you a table of 128k possible outputs.
Allocate and reuse data structures capable of holding a full compression block, replacement table, byte-pair counts, and output bytes in memory. This sounds wasteful of RAM, but when you consider that your block size is small, it's worth it. Your data should be able to sit entirely in CPU L2 or (worst case) L3 cache. This gives a BIG speed boost.
Do one fast pass over the data to collect counts, THEN worry about creating your replacement table.
Pack bytes into integers or short ints whenever possible (applicable mostly to C/C++). A single entry in the counting table can be represented by an integer (16-bit count, plus byte pair).

Code in JustBasic can be found here complete with input text file.
Just BASIC Files Archive – forum post
EBPE by TomC 02/2014 – Ehanced Byte Pair Encoding
EBPE features two post processes to Byte Pair Encoding
1. Is compressing the dictionary (believed to be a novelty)
A dictionary entry is composed of 3 bytes:
AA – the two char to be replaced by (byte pair)
1 – this single token (tokens are unused symbols)
So "AA1" tells us when decoding that every time we see a "1" in the
data file, replace it with "AA".
While long runs of sequential tokens are possible, let’s look at this
8 token example:
AA1BB3CC4DD5EE6FF7GG8HH9
It is 24 bytes long (8 * 3)
The token 2 is not in the file indicating that it was not an open token to
use, or another way to say it: the 2 was in the original data.
We can see the last 7 tokens 3,4,5,6,7,8,9 are sequential so any time we
see a sequential run of 4 tokens or more, let’s modify our dictionary to be:
AA1BB3<255>CCDDEEFFGGHH<255>
Where the <255> tells us that the tokens for the byte pairs are implied and
are incremented by 1 more than the last token we saw (3). We increment
by one until we see the next <255> indicating an end of run.
The original dictionary was 24 bytes,
The enhanced dictionary is 20 bytes.
I saved 175 bytes using this enhancement on a text file where tokens
128 to 254 would be in sequence as well as others in general, to include
the run created by lowercase pre-processing.
2. Is compressing the data file
Re-using rarely used characters as tokens is nothing new.
After using all of the symbols for compression (except for <255>),
we scan the file and find a single "j" in the file. Let this char do double
duty by:
"<255>j" means this is a literal "j"
"j" is now used as a token for re-compression,
If the j occurred 1 time in the data file, we would need to add 1 <255>
and a 3 byte dictionary entry, so we need to save more than 4 bytes in BPE
for this to be worth it.
If the j occurred 6 times we would need 6 <255> and a 3 byte dictionary
entry so we need to save more than 9 bytes in BPE for this to be worth it.
Depending on if further compression is possible and how many byte pairs remain
in the file, this post process has saved in excess of 100 bytes on test runs.
Note: When decompressing make sure not to decompress every "j".
One needs to look at the prior character to make sure it is not a <255> in order
to decompress. Finally, after all decompression, go ahead and remove the <255>'s
to recreate your original file.
3. What’s next in EBPE?
Unknown at this time

I don't believe this can be done in a single pass unless you find a way to predict, given a byte-pair replacement, if the new byte-pair (after-replacement) will be good for replacement too or not.
Here are my thoughts at first sight. Maybe you already do or have already thought all this.
I would try the following.
Two adjustable parameters:
Number of byte-pair occurrences in chunk of data before to consider replacing it. (So that the dictionary doesn't grow faster than the chunk shrinks.)
Number of replacements by pass before it's probably not worth replacing anymore. (So that the algorithm stops wasting time when there's maybe only 1 or 2 % left to gain.)
I would do passes, as long as it is still worth compressing one more level (according to parameter 2). During each pass, I would keep a count of byte-pairs as I go.
I would play with the two parameters a little and see how it influences compression ratio and speed. Probably that they should change dynamically, according to the length of the chunk to compress (and maybe one or two other things).
Another thing to consider is the data structure used to store the count of each byte-pair during the pass. There very likely is a way to write a custom one which would be faster than generic data structures.
Keep us posted if you try something and get interesting results!

Yes, keep us posted.
guarantee?
BobMcGee gives good advice.
However, I suspect that "Limit the block size to less than 65kB ... . This guarantees not all bytes will be used" is not always true.
I can generate a (highly artificial) binary file less than 1kB long that has a byte pair that repeats 10 times, but cannot be compressed at all with BPE because it uses all 256 bytes -- there are no free bytes that BPE can use to represent the frequent byte pair.
If we limit ourselves to 7 bit ASCII text, we have over 127 free bytes available, so all files that repeat a byte pair enough times can be compressed at least a little by BPE.
However, even then I can (artificially) generate a file that uses only the isgraph() ASCII characters and is less than 30kB long that eventually hits the "no free bytes" limit of BPE, even though there is still a byte pair remaining with over 4 repeats.
single pass
It seems like this algorithm can be slightly tweaked in order to do it in one pass.
Assuming 7 bit ASCII plaintext:
Scan over input text, remembering all pairs of bytes that we have seen in some sort of internal data structure, somehow counting the number of unique byte pairs we have seen so far, and copying each byte to the output (with high bit zero).
Whenever we encounter a repeat, emit a special byte that represents a byte pair (with high bit 1, so we don't confuse literal bytes with byte pairs).
Include in the internal list of byte "pairs" that special byte, so that the compressor can later emit some other special byte that represents this special byte plus a literal byte -- so the net effect of that other special byte is to represent a triplet.
As phkahler pointed out, that sounds practically the same as LZW.
EDIT:
Apparently the "no free bytes" limitation I mentioned above is not, after all, an inherent limitation of all byte pair compressors, since there exists at least one byte pair compressor without that limitation.
Have you seen
"SCZ - Simple Compression Utilities and Library"?
SCZ appears to be a kind of byte pair encoder.
SCZ apparently gives better compression than other byte pair compressors I've seen, because
SCZ doesn't have the "no free bytes" limitation I mentioned above.
If any byte pair BP repeats enough times in the plaintext (or, after a few rounds of iteration, the partially-compressed text),
SCZ can do byte-pair compression, even when the text already includes all 256 bytes.
(SCZ uses a special escape byte E in the compressed text, which indicates that the following byte is intended to represent itself literally, rather than expanded as a byte pair.
This allows some byte M in the compressed text to do double-duty:
The two bytes EM in the compressed text represent M in the plain text.
The byte M (without a preceeding escape byte) in the compressed text represents some byte pair BP in the plain text.
If some byte pair BP occurs many more times than M in the plaintext, then the space saved by representing each BP byte pair as the single byte M in the compressed data is more than the space "lost" by representing each M as the two bytes EM.)

You can also optimize the dictionary so that:
AA1BB2CC3DD4EE5FF6GG7HH8 is a sequential run of 8 token.
Rewrite that as:
AA1<255>BBCCDDEEFFGGHH<255> where the <255> tells the program that each of the following byte pairs (up to the next <255>) are sequential and incremented by one. Works great for text
files and any where there are at least 4 sequential tokens.
save 175 bytes on recent test.

Here is a new BPE(http://encode.ru/threads/1874-Alba).
Example for compile,
gcc -O1 alba.c -o alba.exe
It's faster than default.

There is an O(n) version of byte-pair encoding which I describe here. I am getting a compression speed of ~200kB/second in Java.

the easiest efficient structure is a 2 dimensional array like byte_pair(255,255). Drop the counts in there and modify as the file compresses.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio