just a quick question while reading:
http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Mips/format.html
Under I-type Instruction, it says:
"In this case, $rt is the destination register, and $rs is the only source register. It is unusual that $rd is not used, and that $rd does not appear in bit positions B25-21 for both R-type and I-type instructions. Presumably, the designers of the MIPS ISA had their reasons for not making the destination register at a particular location for R-type and I-type. "
Can someone explain what the MIPS ISA designers' reasons are?
Related
In the homework for day one of Xeno Kovah's Introduction to x86 Assembly hosted on OpenSecurityTraining, he assigns,
Instructions we now know(24)
NOP
PUSH/POP
CALL/RET
MOV/LEA
ADD/SUB
JMP/Jcc
CMP/TEST
AND/OR/XOR/NOT
SHR/SHL
IMUL/DIV
REP STOS, REP MOV
LEAVE
Write a program to find an instruction we havenʼt covered, and report the
instruction tomorrow.
He further predicates the assignment on,
Instructions to be covered later which donʼt count: SAL/SAR
Variations on jumps or the MUL/IDIV variants of IMUL/DIV also don't count
Additional off-limits instructions: anything floating point (since we're not covering those in this class.)
He says in the video that you can not use inline assembly. (mentioned when asked).
Rather than objdumping random executable and auditing them then creating the source, is it possible to find the list of x86 assembly instructions that GCC currently outputs?
The foundation for this question seems to be that there is a very small subset of instructions actually used that one needs to know to reverse engineer (which is the focus of the course). Xeno seems to be trying to find a fun instructive way to make that point,
I think that knowing about 20-30 (not counting variations) is good enough that you will have the check the manual very infrequently
While I welcome everyone to join me in this awesome class at OpenSecurityTraining, the question is about my proposed method of figuring it out from GCC (if possible). Not, for people to actually do Xeno's assignment. ;)
The foundation for this question seems to be that there is a very small subset of instructions actually used that one needs to know to reverse engineer
Yes, that's generally true. There are some instructions gcc will never emit, like enter (because it's much slower than push rbp / mov rbp, rsp / sub rsp, some_constant on modern CPUs).
Other old / obscure stuff like xlat and loop will also be unused because they aren't faster, and gcc's -Os doesn't go all-out optimizing for size without caring about performance. (clang -Oz is more aggressive, but IDK if anyone's bothered to teach it about the loop instruction.)
And of course gcc will never emit privileged instructions like wrmsr. There are intrinsics (__builtin_... functions) for some unprivileged instructions like rdtsc or cpuid which aren't "normal".
is it possible to find the list of x86 assembly instructions that GCC currently outputs?
This would be the gcc machine-definition files. GCC as a portable compiler has it's own text-based language for machine-definition files which describe the instruction-set to the compiler. (What each instruction does, what addressing modes it can use, and some kind of "cost" the optimizer can minimize.)
See the gcc-internals documentation for them.
The other approach to this question would be to look at an x86 instruction reference manual (e.g. this HTML extract, and see other links in the x86 tag wiki) and look for ones you haven't seen yet. Then write a function where gcc would find it useful.
e.g. if you haven't seen movsx (sign extension) yet, then write
long long foo(int x) { return x; }
and gcc -O3 will emit (from the Godbolt compiler explorer)
movsx rax, edi
ret
Or to get cdqe (aka cltq in AT&T syntax) for sign-extension within rax, force gcc to do math before sign extending, so it can produce the result in eax first (with a copy-and-add lea).
long long bar(unsigned x) { return (int)(x+1); }
lea eax, [rdi+1]
cdqe
ret
# clang chooses inc edi / movsxd rax, edi
See also Matt Godbolt's CppCon2017 talk: “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”, and How to remove "noise" from GCC/clang assembly output?.
Getting gcc to emit rotate instructions is interesting. Best practices for circular shift (rotate) operations in C++. You write it as shifts/OR that gcc can recognize as a rotate.
Because C doesn't provide standard functions for lots of things modern CPUs can do (rotate, popcnt, count leading / trailing zeros), the only portable thing is to write an equivalent function and have the compiler to recognize that pattern. gcc and clang can optimize a whole loop into a single popcnt instruction when compiling with -mpopcnt (enabled by -march=haswell, for example), if you're lucky. If not, you get a stupid slow loop. The reliable non-portable way is to use __builtin_popcount(), which compiles to a popcnt instruction if the target supports it, otherwise a table lookup. _mm_popcnt_u64 is popcnt or nothing: it doesn't compile if the target doesn't support the instruction.
Of course the catch 22 flaw with this approach is that it only works if you already know the x86 instruction set and when any given instruction is the right choice for an optimizing compiler!
(And what gcc chooses to do, e.g. inline string compares to rep cmpsb in some cases for short strings, although I'm not sure this is optimal. Only rep movs / rep stos have "fast strings" support on modern CPUs. But I don't think gcc will ever use lods, or any of the "string" instructions without a rep prefix.)
Rather than objdumping random executable and auditing them then creating the source, is it possible to find the list of x86 assembly instructions that GCC currently outputs?
You can look at the machine description files that gcc uses. In its source tree, look under gcc/config/i386 and have a look at the .md files. The core one for x86 is i386.md; there are others for the various extensions to x86 (and possibly containing heuristics tunings to use when optimizing for different processors).
Be warned: it's definitely not an easy read.
I think that knowing about 20-30 (not counting variations) is good enough that you will have the check the manual very infrequently
It's quite true; in my experience doing reverse engineering, 99% of code is always the same stuff, instruction-wise; what is more useful than knowing the entire x86 instruction set is to get familiar with the assembly idioms, especially those frequently emitted by compilers.
That being said, from the top of my mind some very common instructions missing (emitted quite often and without enabling extended instruction sets) are:
movzx/movsx
inc/dec (rare with gcc, common with VC++)
neg
cdq (before idiv)
jcxz/jecxz (rare with gcc, somewhat common with VC++)
setCC
cmpxchg (in synchronization code);
cmovCC
adc (when doing 64 bit arithmetic in 32 bit code)
int3 (often emitted on function boundaries and in general as a filler)
some other string instructions (scas/cmps), especially as canned sequences on older compilers
And then there's the whole world of SSE & co...
Extensive searching has sent me in a loop over the course of 3 days, so I'm depending on you guys to help me catch a break.
Why exactly does one 8-bit sequence of high's and low's perform this action, and 8-bit sequence performs that action.
My intuition tells me that the CPU's circuitry hard-wired one binary sequence to do one thing, and another to do another thing. That would mean different Processor's with potentially different chip circuitry wouldn't define one particular binary sequence as the same action as another?
Is this why we have assembly? I need someone to confirm and/or correct my hypothesis!
Opcodes are not always 8 bits but yes, it is hardcoded/wired in the logic to isolate the opcode and then send you down a course of action based on that. Think about how you would do it in an instruction set simulator, why would logic be any different? Logic is simpler than software languages, there is no magic there. ONE, ZERO, AND, OR, NOT thats as complicated as it gets.
Along the same lines if I was given an instruction set document and you were given an instruction set document and told to create a processor or write an instruction set simulator. Would we produce the exact same code? Even if the variable names were different? No. Ideally we would have programs that are functionally the same, they both parse the instruction and execute it. Logic is no different you give the spec to two engineers you might get two different processors that functionally are the same, one might perform better, etc. Look at the long running processor families, x86 in particular, they re-invent that every couple-three years being instruction set compatible for the legacy instructions while sometimes adding new instructions. Same for ARM and others.
And there are different instruction sets ARM is different from x86 is different from MIPS, the opcodes and/or bits you examine in the instruction vary, for none of these can you simply look at 8 bits, each you have some bits then if that is not enough to uniquely identify the instruction/operation then you need to examine some more bits, where those bits are what the rules are are very specific to each architecture. Otherwise what would be the point of having different names for them if they were the same.
And this information was out there you just didnt look in the right places, there are countless open online courses on the topic, books that google should hit some pages on, as well as open source processor cores you can look at and countless instruction set simulators with source code.
How do I know whether MASM encodes my JMP instruction using a relative or absolute offset?
I know that x86 provides JMP opcodes for relative and absolute offsets.
I want to be certain that my jumps are relative, however I cannot find any proper MASM documentation telling me whether JMP #label actually translates into a relative jump.
Please, if possible, give a link to documentation in the answer.
For the opposite issue:
See How to code a far absolute JMP/CALL instruction in MASM? if you're trying to get MASM to emit a direct absolute far jmp
The only machine encodings for direct near jmps use relative displacements, not absolute. See also the x86 tag wiki for more links to docs / manuals, including the official MASM manual. I don't think there's any point wading through it for this issue, though.
There is no near-jmp that takes an immediate absolute displacement, so there's no risk of an assembler ever using a non-PIC branch unexpectedly.
An absolute near jump would require the address in a register or memory operand, so MASM can't just assemble jmp target into
section .rodata
pointer_to_target dq target ; or dd for a 32bit pointer
section .text
jmp [pointer_to_target]
because that would be ridiculous. IDK if you'd ever find documentation stating that this specific piece of insanity is specifically impossible.
The only way I could imagine an assembler doing anything like this for you is if there was some kind of "Huge code model" where all jump targets were referenced with 64 bit pointers instead of 32 bit relative displacements.
But AFAIK, if you want to do that, you have to do it yourself in all existing assemblers.
Documentation gets bloated enough without mentioning every weird thing that a program doesn't do, so I wouldn't expect to find anything specific about this. This is probably one of those things that's assumed to be obvious and goes without saying. (Or that the syntax matches what Intel uses in their instruction reference manual.)
As Jester says, you won't get a far jump without asking for it, so there's no risk of any assembler using the JMP ptr16:32 encoding (EA imm16(segment) imm32(offset)) for a normal jmp.
While looking at the atmel 8-bit AVR instruction set ( http://www.atmel.com/Images/doc0856.pdf ) I found the instruction format quite complex. A lot of instructions have different bit fields, where bits of operands/opcode are at different places in the instruction: why is that so? Isn't it more difficult for the decode unit to actually decode the opcode and operands with this format?
I think it largely may come from considerations of compatibility.
On some stage one instruction gets more powerful and the additional option is encoded in the bits that are free, so that the old instruction word still invokes the old behaviour.
I'm a beginner in assembly language and have noticed that the x86 code emitted by compilers usually keeps the frame pointer around even in release/optimized mode when it could use the EBP register for something else.
I understand why the frame pointer might make code easier to debug, and might be necessary if alloca() is called within a function. However, x86 has very few registers and using two of them to hold the location of the stack frame when one would suffice just doesn't make sense to me. Why is omitting the frame pointer considered a bad idea even in optimized/release builds?
Frame pointer is a reference pointer allowing a debugger to know where local variable or an argument is at with a single constant offset. Although ESP's value changes over the course of execution, EBP remains the same making it possible to reach the same variable at the same offset (such as first parameter will always be at EBP+8 while ESP offsets can change significantly since you'll be pushing/popping things)
Why don't compilers throw away frame pointer? Because with frame pointer, the debugger can figure out where local variables and arguments are using the symbol table since they are guaranteed to be at a constant offset to EBP. Otherwise there isn't an easy way to figure where a local variable is at any point in code.
As Greg mentioned, it also helps stack unwinding for a debugger since EBP provides a reverse linked list of stack frames therefore letting the debugger to figure out size of stack frame (local variables + arguments) of the function.
Most compilers provide an option to omit frame pointers although it makes debugging really hard. That option should never be used globally, even in release code. You don't know when you'll need to debug a user's crash.
Just adding my two cents to already good answers.
It's part of a good language architecture to have a chain of stack frames. The BP points to the current frame, where subroutine-local variables are stored. (Locals are at negative offsets, and arguments are at positive offsets.)
The idea that it is preventing a perfectly good register from being used in optimization raises the question: when and where is optimization actually worthwhile?
Optimization is only worthwhile in tight loops that 1) do not call functions, 2) where the program counter spends a significant fraction of its time, and 3) in code the compiler actually will ever see (i.e. non-library functions). This is usually a very small fraction of the overall code, especially in large systems.
Other code can be twisted and squeezed to get rid of cycles, and it simply won't matter, because the program counter is practically never there.
I know you didn't ask this, but in my experience, 99% of performance problems have nothing at all to do with compiler optimization. They have everything to do with over-design.
It depends on the compiler, certainly. I've seen optimized code emitted by x86 compilers that freely uses the EBP register as a general purpose register. (I don't recall which compiler I noticed that with, though.)
Compilers may also choose to maintain the EBP register to assist with stack unwinding during exception handling, but again this depends on the precise compiler implementation.
However, x86 has very few registers
This is true only in the sense that opcodes can only address 8 registers. The processor itself will actually have many more registers than that and use register renaming, pipelining, speculative execution, and other processor buzzwords to get around that limit. Wikipedia has a good introductory paragraph as to what an x86 processor can do to overcome the register limit: http://en.wikipedia.org/wiki/X86#Current_implementations.
Using stack frames has gotten incredibly cheap in any hardware even remotely modern. If you have cheap stack frames then saving a couple of registers isn't as important. I'm sure fast stack frames vs. more registers was an engineering trade-off, and fast stack frames won.
How much are you saving going pure register? Is it worth it?