Any byte sequences that can not be present in valid x86 code?
I'm looking for a byte sequence (or sequences), to inject into an x86 program compiled using GCC, that can not show up in the binary as a by product of compilation.
The reason is that I want these byte sequences to act as "labels", so that I can recognize them later during inspection.
Is it possible to construct patterns of bytes, so that, searching through the binary, these patterns will not show up except with very small probability (I prefer probability zero). In other words, I want to minimize the number of false positives!
There are sequences that today are not a valid encoding of any instruction.
Rather than digging in the opcode table present in the Intel Manual 2 you can exploit two facts of the x86 architecture:
The maximum instruction length is 15 bytes.
You can repeat prefixes.
These should also be more stable across generations than reserved opcodes.
The sequence 666666666666666666666666666666 (15 operand-size override prefixes, but any prefix will do) will generate an #UD exception because it is invalid.
For what it's worth, there is a specific instruction that fulfills the role of invalid instruction: ud2.
It's presence in a binary module is possible but its more idiomatic than an invalid encoding and it is standard, for example Linux uses it to mark a bug for if ud2 is the execution flow, the code behind it cannot be valid.
That said, if I got you right, that's not going to be useful to you.
You want to skip the process of decoding the instructions and scan the code section of the binary instead.
There is no guarantee that the code section will contain only code, for example ARM compilers generate literal pools - that's definitively uncommon on x86 though.
However the compilers usually align functions to a specific boundary (usually 16 bytes), this can be done in several ways - like stretching the previous function or with a mere padding.
This padding can be a sequence of bytes of any value - hence arbitrary bytes can be present in the code section.
Long story short, there is no universal byte sequence that appear with probability zero in the code section.
Everything that it's not in the execution flow can have any value.
We will deal with probability later, for now lets assume the 66..66h appears rarely enough in an executable.
You can't just use it directly, as 66..66h can be part of two instructions and thus be a valid sequence:
mov rax, 6666666666666666h
db 66h, 66h, 66h , 66h
db 66h, 66h, 66h
nop
is valid.
This is due to the immediate operands of instructions - the biggest immediate can be 8 bytes in length (as today), so the sequence must be lengthen to 15 + 8 = 23 bytes.
If you really want to be safe again future features, you can use a sequence of 14 + 15 = 29 bytes (for the 15-byte instruction length limit).
It's possible to find 23/29 bytes of value 66h in the code section or in the whole binary.
But how probable is that?
If the bytes in a binary were uniformly random then the probability would be astronomically small: 256-23 = 2-184.
Well, the point is that the bytes in a binary are not uniformly random.
You can open a file with an embedded icon to confirm that.
You can make the probability arbitrarily small by stretching the sequence - it's up to you to find a compromise between the length and an acceptable number of false positives.
It's unclear what you want to do but here some advice:
Most, if not all, building tools support generating a map file.
It is a file with all the symbols/names and their addresses.
If you could use actual labels (with a prefix and a random suffix) you'd collect them easily after the build.
Most output formats can be enriched with meta-information.
You can add an ELF/PE section with a table of offsets to the locations you want to mark.
I was going through the go tutorial on golang.org and I came across an example that i partially understand...
MaxInt uint64 = 1<<64 - 1
Now I understand this to be shifting the bit 64 places to the left which would make it a 1 followed by 64 0's.
My question is why is this the max integer that can be achieved in a 64 bit number. Wouldn't the max integer be 111111111....(until the 64th 1) instead of 100000...(until the 64th one)?
What happens here, step by step:
Take 1.
Shift it to the left 64 bits. This is tricky. The result actually needs 65 bits for representation - namely 1 followed by 64 zeroes. Since we are calculating a 64 bit value here why does this even compile instead of overflowing to 0 or 1 or producing a compile error?
It works because the arithmetic used to calculate constants in Go is a bit magic (https://blog.golang.org/constants) in that it has nothing to do whatsoever with the type of the named constant being calculated. You can say foo uint8 = 1<<415 / 1<<414 and foo is now 2.
Subtract 1. This brings us back into 64 bits numbers, as it's actually 11....1 (64 times), which is indeed the maximum value of uint64. Without this step, the compiler would complain about us trying to cram 65 bit value into uint64.
Name the constant MaxInt and give it type uint64. Success!
The magic arithmetic used to calculate constants still has limitations (obviously). Shifts greater than 500 or so produce funny named stupid shift errors.
This is from the book Assembly Language Step By Step, Jeff Duntemann:
Here’s the quick tour: A bit is a single binary digit, 0 or 1. A byte
is 8 bits side by side. A word is 2 bytes side by side. A double word
is 2 words side by side. A quad word is 2 double words side by side.
And this is from the book Principles of Computer Organization and Assembly Language: Using the Java Virtual Machine, Patrick Juola:
For convenience, 8 bits are usually grouped into a single block,
conventionally called a byte. The next-largest named block of bits is
a word. The definition and size of a word are not absolute, but vary
from computer to computer. A word is the size of the most convenient
block of data for the computer to deal with.
So is a word 2 bytes (16 bits), or is it the most convenient block of data for the computer to deal with? (I am also not sure what this means..)
I'm not familiar with either of these books, but the second is closer to current reality. The first may be discussing a specific processor.
Processors have been made with quite a variety of word sizes, not always a multiple of 8.
The 8086 and 8087 processors used 16 bit words, and it's likely this is the machine the first author was writing about.
More recent processors commonly use 32 or 64 bit words.
In the 50's and 60's there were machines with words sizes that seem quite strange to us now, such as 4, 9 and 36. Since about the 70's word size has commonly been a power of 2 and a multiple of 8.
On x86/x64 processors, a byte is 8 bits, and there are 256 possible binary states in 8 bits, 0 thru 255. This is how the OS translates your keyboard key strokes into letters on the screen. When you press the 'A' key, the keyboard sends a binary signal equal to the number 97 to the computer, and the computer prints a lowercase 'a' on the screen. You can confirm this in any Windows text editing software by holding an ALT key, typing 97 on the NUMPAD, then releasing the ALT key. If you replace '97' with any number from 0 to 255, you will see the character associated with that number on the system's character code page printed on the screen.
If a character is 8 bits, or 1 byte, then a WORD must be at least 2 characters, so 16 bits or 2 bytes. Traditionally, you might think of a word as a varying number of characters, but in a computer, everything that is calculable is based on static rules. Besides, a computer doesn't know what letters and symbols are, it only knows how to count numbers. So, in computer language, if a WORD is equal to 2 characters, then a double-word, or DWORD, is 2 WORDs, which is the same as 4 characters or bytes, which is equal to 32 bits. Furthermore, a quad-word, or QWORD, is 2 DWORDs, same as 4 WORDs, 8 characters, or 64 bits.
Note that these terms are limited in function to the Windows API for developers, but may appear in other circumstances (eg. the Linux dd command uses numerical suffixes to compound byte and block sizes, where c is 1 byte and w is bytes).
The second quote is correct, the size of a word varies from computer to computer. The ARM NEON architecture is an example of an architecture with 32-bit words, where 64-bit quantities are referred to as "doublewords" and 128-bit quantities are referred to as "quadwords":
A NEON operand can be a vector or a scalar. A NEON vector can be a 64-bit doubleword vector or a 128-bit quadword vector.
Normally speaking, 16-bit words are only found on 16-bit systems, like the Amiga 500.
This is from the book Hackers: Heroes of the Computer Revolution by Steven Levy.
.. the memory had been reduced to 4096 "words" of eighteen bits each.
(A "bit" is a binary digit, either a 1 or 0. A series of binary
numbers is called a "word").
As the other answers suggest, a "word" does not seem to have a fixed length.
In addition to the other answers, a further example of the variability of word size (from one system to the next) is in the paper Smashing The Stack For Fun And Profit by Aleph One:
We must remember that memory can only be addressed in multiples of the
word size. A word in our case is 4 bytes, or 32 bits. So our 5 byte buffer
is really going to take 8 bytes (2 words) of memory, and our 10 byte buffer
is going to take 12 bytes (3 words) of memory.
"most convenient block of data" probably refers to the width (in bits) of the WORD, in correspondance to the system bus width, or whatever underlying "bandwidth" is available. On a 16 bit system, with WORD being defined as 16 bits wide, moving data around in chunks the size of a WORD will be the most efficient way. (On hardware or "system" level.)
With Java being more or less platform independant, it just defines a "WORD" as the next size from a "BYTE", meaning "full bandwidth". I guess any platform that's able to run Java will use 32 bits for a WORD.
Another instance of a book citing the variable length of the Word is Operating System Concepts by Sileberschatz, Galvin, Gagne where the authors in Chapter 1 page 6 state:
A less common term is "word",
which is a given computer architecture's native storage unit. A word is
generally made up of one or more bytes. For example, a computer may have
instructions to move 64-bit (8-byte) words.
In this small code example:
__m128i twos = _mm_set_epi32(2,3,1,2);
__m128i foo = _mm_set_epi32(128,128,128,128);
__m128i shifted = _mm_srl_epi32(foo,twos);
"shifted" is full of zeroes, while I expect it two be full of four 32-bit integers with the values 32,16,64, and 32, respectively. Am I using the intrinsic wrong?
Yes, you are using it incorrectly. The second argument to _mm_srl_epi32() specifies the amount of bits to shift the first argument by, but it isn't a vectored argument as you might expect, allowing you to shift each 32-bit integer by a different number of bits. Instead, the 128-bit argument is truncated to 64 bits, and the resulting count is used to determine the number of bits to shift; the same shift amount is used for all 4 integers in the first argument. In your case, the lower 64 bits are 0x0000000100000010, which evaluates to a very large positive number. This results in all of the elements of foo getting flushed to zero as all of the bits are shifted out.
A good place to find all of the little details on every instruction out there is Intel's AVX Programmer's Reference. While the title may be somewhat of a misnomer, the document contains descriptions of all SSE/SSE2/.../AVX/AVX2 instructions and descriptions of their intrinsics available in Intel's C++ compiler (which are typically also available in gcc and others). Searching for _mm_srl_epi32 in the document yields a clear explanation on exactly what the instruction does.
The above have answers in blue
Which bits will be used for what? Eg. I know lst 2 LSB are byte offset. Then which bits will be used for word select, block select & tag?
What I did was
2 LSB: Byte offset
1 Bit: Word select
4 Bit: Block select
Rest: Tag
But it appears wrong? For (a) I have
(b) incomplete I think but its already wrong
This part of your reasoning is incorrect:
2 LSB: Byte offset
1 Bit: Word select
4 Bit: Block select
Rest: Tag
Remember that the problem says the word size is 16 bits or two bytes, so only one address bit is used for byte offset within a word.
Again, this has absolutely nothing to do with MIPS which has a 32 or 64 bit word.