Debuggers generate int3 instruction at the breakpoint so that the control goes to debug exception handler. For int3 debuggers will insert 0x cc. Why don't they insert 0x cd 03 which also means int3? What will happen if they insert 0x cd 03 instead of 0x cc?
Intel's documentation for that instruction answers your question for you: the Description section in the vol.2 manual entry for int / int3 says:
The INT3 instruction uses a one-byte opcode (CC) and is intended for calling the debug exception handler with a breakpoint exception (#BP). (This one-byte form is useful because it can replace the first byte of any instruction at which a breakpoint is desired, including other one-byte instructions, without overwriting other instructions.)
There are some 1-byte x86 instructions like push reg. Or in 16/32-bit mode inc/dec reg. If a 1-byte instruction was the last instruction before a branch target, overwriting it with a longer instruction would corrupt the first byte of an instruction that can run without reaching the breakpoint.
There are other differences for vm86 mode between the two byte cd 03 encoding of int 3 vs the 1-byte int3 encoding, again documented right in the manual. (Presumably to make it easier to write debuggers that debug the vm86 guest from outside the vm86 environment.)
An interrupt generated by the INTO, INT3, or INT1 instruction differs from one generated by INT n in the following ways:
The normal IOPL checks do not occur in virtual-8086 mode. The interrupt is taken (without fault) with any IOPL value.
The interrupt redirection enabled by the virtual-8086 mode extensions (VME) does not occur. The interrupt is always handled by a protected-mode handler.
(These features do not pertain to CD03, the “normal” 2-byte opcode for INT 3. Intel and Microsoft assemblers will not generate the CD03 opcode from any mnemonic, but this opcode can be created by direct numeric code definition or by self-modifying code.)
BTW, in NASM/YASM, you do need to use the int3 mnemonic for the 1 byte encoding; int 3 does assemble to CD 03
GAS assembles int $3 the same as int3, to 0xcc
But of course debuggers are working with binary machine code, not asm source. The assembly-mnemonic stuff only applies to manually including a breakpoint in your asm source. (Yes, that's a thing you can do. Most debuggers will let you resume by skipping over that instruction.)
Related
As far as I know SW-breakpoints are working as follows:
The instruction the BP is set to gets substituted by a int/trap instruction, than the trap is handled in a trap handler, on continue the trap is replaced by the original instruction, the instruction is executed in single step mode, now the PC points to the next instruction and the original instruction is replaced again by a int/trap instruction.
HW Breakpoints work as follows according to my understanding:
The address of the instruction the BP is set to is written in a HW-BP Register. If the instruction is hit respectively the PC matches the address in the HW-BP Register, the CPU raises an interrupt which is also handled by a trap handler. Now if the program returns to the orignial instruction the HW BP is still active and one is caught in an infinite loop.
How is that problem treated?
Is the HW BP disabled before continuing and is the orignal instruction also getting executed in single step mode? Or is the original instruction executed before the trap handler is entered, so that the trap handler returns to the instruction after the original instruction? Or is there an other mechanism?
In case of the Intel 64 and IA-32 ("x64/x86") architectures, this is the task of the Resume Flag (RF), bit 16 in EFLAGS. (Other processor architectures that support hardware breakpoints probably have a similar mechanism.)
See section 18.3.1.1 in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B:
Because the debug exception for an instruction breakpoint is generated before the instruction is executed, if the instruction breakpoint is not removed by the exception handler; the processor will detect the instruction breakpoint again when the instruction is restarted and generate another debug exception. To prevent looping on an instruction breakpoint, the Intel 64 and IA-32 architectures provide the RF flag (resume flag) in the EFLAGS register (see Section 2.3, “System Flags and Fields in the EFLAGS Register,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A). When the RF flag is set, the processor ignores instruction breakpoints.
[...]
The RF Flag is cleared at the start of the instruction after the check for code breakpoint, CS limit violation and FP exceptions.
[...]
If the RF flag in the EFLAGS image is set when the processor returns from the exception handler, it is copied into the RF flag in the EFLAGS register by IRETD/IRETQ or a task switch that causes the return. The processor then ignores instruction breakpoints for the duration of the next instruction. (Note that the POPF, POPFD, and IRET instructions do not transfer the RF image into the EFLAGS register.) Setting the RF flag does not prevent other types of debug-exception conditions (such as, I/O or data breakpoints) from being detected, nor does it prevent non-debug exceptions from being generated.
(Emphasis mine.)
So, the debugger will set RF before returning from the exception handler so that instruction breakpoints are "muted" for one instruction, after which the flag is automatically cleared by the processor.
Note that this is not a concern in the case of data breakpoints because these will fire after the instruction that triggered the read/write operation.
Recommendation: I find the slides of "Intermediate x86 Part 4" by Xeno Kovah to be helpful in understanding these things. He talks about various topics there but starts with debugging. This information in particular can be found on slides 12-13:
Image credit: Xeno Kovah, CC BY-SA 3.0
I am modelling a custom MOV instruction in the X86 architecture in the gem5 simulator, to test its implementation on the simulator, I need to compile my C code using inline assembly to create a binary file. But since it a custom instruction which has not been implemented in the GCC compiler, the compiler will throw out an error. I know one way is to extend the GCC compiler to accept my custom X86 instruction, but I do not want to do it as it is more time consuming(but will do it afterwards).
As a temporary hack (just to check if my implementation is worth it or not). I want to edit an already MOV instruction while changing its underlying "micro ops" in the simulator so as to trick the GCC to accept my "custom" instruction and compile.
As they are many types of MOV instructions which are available in the x86 architecture. As they are various MOV Instructions in the 86 architecture reference.
Therefore coming to my question, which MOV instruction is the least used and that I can edit its underlying micro-ops. Assuming my workload just includes integers i.e. most probably wont be using the xmm and mmx registers and my instructions mirrors the same implementation of a MOV instruction.
Your best bet is regular mov with a prefix that GCC will never emit on its own. i.e. create a new mov encoding that includes a mandatory prefix in front of any other mov. Like how lzcnt is rep bsr.
Or if you're modifying GCC and as, you can add a new mnemonic that just uses otherwise-invalid (in 64-bit mode) single byte opcodes for memory-source, memory-dest, and immediate-source versions of mov. AMD64 freed up several opcodes, including the BCD instructions like AAM, and push/pop most segment registers. (x86-64 can still mov to/from Sregs, but there's just 1 opcode per direction, not 2 per Sreg for push ds/pop ds etc.)
Assuming my workload just includes integers i.e. most probably wont be using the xmm and mmx registers
Bad assumption for XMM: GCC aggressively uses 16-byte movaps / movups instead of copying structs 4 or 8 bytes at a time. It's not at all rare to find vector mov instructions in scalar integer code as part of inline expansion of small known-length memcpy or struct / array init. Also, those mov instructions have at least 2-byte opcodes (SSE1 0F 28 movaps, so a prefix in front of plain mov is the same size as your idea would have been).
However, you're right about MMX regs. I don't think modern GCC will ever emit movq mm0, mm1 or use MMX at all, unless you use MMX intrinsics. Definitely not when targeting 64-bit code.
Also mov to/from control regs (0f 21/23 /r) or debug registers (0f 20/22 /r) are both the mov mnemonic, but gcc will definitely never emit either on its own. Only available with GP register operands as the operand that isn't the debug or control register. So that's technically the answer to your title question, but probably not what you actually want.
GCC doesn't parse its inline asm template string, it just includes it in its asm text output to feed to the assembler after substituting for %number operands. So GCC itself is not an obstacle to emitting arbitrary asm text using inline asm.
And you can use .byte to emit arbitrary machine code.
Perhaps a good option would be to use a 0E byte as a prefix for your special mov encoding that you're going to make GEM decode specially. 0E is push CS in 32-bit mode, invalid in 64-bit mode. GCC will never emit either.
Or just an F2 repne prefix; GCC will never emit repne in front of a mov opcode (where it doesn't apply), only movs. (F3 rep / repe means xrelease when used on a memory-destination instruction so don't use that. https://www.felixcloutier.com/x86/xacquire:xrelease says that F2 repne is the xacquire prefix when used with locked instructions, which doesn't include mov to memory so it will be silently ignored there.)
As usual, prefixes that don't apply have no documented behaviour, but in practice CPUs that don't understand a rep / repne ignore it. Some future CPU might understand it to mean something special, and that's exactly what you're doing with GEM.
Picking .byte 0x0e; instead of repne; might be a better choice if you want to guard against accidentally leaving these prefixes in a build you run on a real CPU. (It will #UD -> SIGILL in 64-bit mode, or usually crash from messing up the stack in 32-bit mode.) But if you do want to be able to run the exact same binary on a real CPU, with the same code alignment and everything, then an ignored REP prefix is ideal.
Using a prefix in front of a standard mov instruction has the advantage of letting the assembler encode the operands for you:
template<class T>
void fancymov(T& dst, T src) {
// fixme: imm -> mem needs a size suffix, defeating template
// unless you use Intel-syntax where the operand includes "dword ptr"
asm("repne; movl %1, %0"
#if 1
: "=m"(dst)
: "ri" (src)
#else
: "=g,r"(dst)
: "ri,rmi" (src)
#endif
: // no clobbers
);
}
void test(int *dst, long src) {
fancymov(*dst, (int)src);
fancymov(dst[1], 123);
}
(Multi-alternative constraints let the compiler pick either reg/mem destination or reg/mem source. In practice it prefers the register destination even when that will cost it another instruction to do its own store, so that sucks.)
On the Godbolt compiler explorer, for the version that only allows a memory-destination:
test(int*, long):
repne; movl %esi, (%rdi) # F2 E9 37
repne; movl $123, 4(%rdi) # F2 C7 47 04 7B 00 00 00
ret
If you wanted this to be usable for loads, I think you'd have to make 2 separate versions of the function and use the load version or store version manually, where appropriate, because GCC seems to want to use reg,reg whenever it can.
Or with the version allowing register outputs (or another version that returns the result as a T, see the Godbolt link):
test2(int*, long):
repne; mov %esi, %esi
repne; mov $123, %eax
movl %esi, (%rdi)
movl %eax, 4(%rdi)
ret
Test platform is windows 32bit.
I use IDA pro to disassemble a PE file, do some very tedious transform work, and re-assembly it into a new PE file.
But there is some difference in the re-assembled PE file and the original one if I use OllyDbg
to debug the new PE file (although there is no difference of this part in the assembly file I transformed)
Here is part of the original one:
See the
PUSH 8
PUSH 0
is correct.
Here is part of my new PE file:
See now the
PUSH 8
PUSH 0
is changed to
66:6A 08
66:6A 00
and it lead to the failure of the new PE's execution.
Basically, from what I have seen, it lead to the un-align of stack.
So does anyone know what is wrong with this part? I don't see any difference in the assembly code I transform....
Could anyone give me some help? Thank you!
66h is the operand-size override prefix. In 32-bit code, it switches the operand size to 16-bit from the default 32-bit. So what happens here is that the PUSH instruction pushes a 16-bit value on the stack instead of the 32-bit one, and the ESP is decremented by 2 instead of 4. That's why you get unbalanced stack after the call.
You should check your assembler's documentation to see how you can force 32-bit operand size for the PUSH imm instructions. Different assemblers use different conventions for that. For example, in NASM you'd probably use something like push dword 8.
It is a "prefix" opcode byte: See http://wiki.osdev.org/X86-64_Instruction_Encoding#Legacy_Prefixes
0x66 means "operand size override". Your code is apparantly operating in 32-bit mode; PUSH without the prefix will push a 32 bit value. I think what this does is cause the PUSH to fetch a 16 bit value, and push that as a 32 bit value on the stack. (I write a lot of assembly code, and have never had need to do that).
After much trial and error, I still have some trouble understanding why the assembly syntax used in my textbook caused so many issues when using Windows 8.
.MODEL SMALL
.586
.STACK 100h
.DATA
Message DB 'Hello, my name blank', 13, 10, '$'
.CODE
Hello PROC
mov ax, #data
mov ds, ax
mov dx, OFFSET Message
mov ah, 9h
int 21h
mov al, 0
mov ah, 4ch
int 21h
Hello ENDP
END Hello
At first I tried running the code with masm32, using the command prompt and correct linker. Then I tried using Visual Studio 2013 ultimate; even using masm32 within Visual Studio, I got the similar issues each time. The assembler had issues with the #data line, and no leading underscore for Hello. Fixing the latter only resulted in a issue with unmatched blocks.
I did find a workaround by using a MS-DOS virtual environment, and the code worked fine after removing the .586 instruction.
I suspect the main issues were trying to run this code in a x64 OS environment, but I'm still learning the language so I'd like to hear other opinions on why I couldn't get it to run initially.
The book we're using is Jones, Assembly Language for the IBM PC Family 3rd edition.
You are using a 32 bit linker. You need to use the 16 bit linker called link16 in masm32/bin to link the code.
e.g.
ml /c /Fl filename.asm
-then-
link16 filename.obj
The difference between the 16 bit and the 32 bit addressmode is the default size of the operands/registers and addresses inside of our codesegment and how the assembler use the operandsize and the addresssize prefixes.
Within the 16 bit addressmode the default size is 16 bit and if we want to use 32 bit register/operands and/or 32 bit addresses within the 16 bit addressmode, then our assembler have to place an operandsize and/or an adddresssize prefix to all of those 32 bit instructions. But if we use only 16 bit instructions within the 16 bit adressmode, then we do not need those operandsize and/or adddresssize prefixes.
Whithin the 32 bit addressmode the default size is 32 bit and if we want to use 32 bit register/operands and/or 32 bit addresses within the 32 bit addressmode, then our assembler do not have to place an operandsize and/or an adddresssize prefix to all of our 32 bit instructions. (This is good for to minimize the number of bytes of our code, if we use mostly 32 bit instructions.) But if we use 16 bit instructions within the 32 bit adressmode, then our assembler have to place the operandsize and/or adddresssize prefixes.
Additional there are two assembler directives(use16 and use32) for to determine for wich adressmode the code is written, if we want to have different parts of code for both addressmodes.
..
Beside both addressmodes there are also a large difference between the realmode and the protected mode.
For the realmode in combination with the 16 bit Addressmode(default on startup) we become a default segmentsize of 64 KB segments and all addresses will be calculate together with the segment part of a segmentregister and the offset part for to build an address. For the protected mode we have to use global and/or local descriptor tables for to specify the size of a segment that we want to use.
...
At last the architecture of the underlying operating system give us the demands for the target for that we have to assemble our code and which software interrupts are aviable for to use.
Dirk
I would like to test a buffer-overflow by writing "Hello World" to console (using Windows XP 32-Bit). The shellcode needs to be null-free in order to be passed by "scanf" into the program I want to overflow. I've found plenty of assembly-tutorials for Linux, however none for Windows. Could someone please step me through this using NASM? Thxxx!
Assembly opcodes are the same, so the regular tricks to produce null-free shellcodes still apply, but the way to make system calls is different.
In Linux you make system calls with the "int 0x80" instruction, while on Windows you must use DLL libraries and do normal usermode calls to their exported functions.
For that reason, on Windows your shellcode must either:
Hardcode the Win32 API function addresses (most likely will only work on your machine)
Use a Win32 API resolver shellcode (works on every Windows version)
If you're just learning, for now it's probably easier to just hardcode the addresses you see in the debugger. To make the calls position independent you can load the addresses in registers. For example, a call to a function with 4 arguments:
PUSH 4 ; argument #4 to the function
PUSH 3 ; argument #3 to the function
PUSH 2 ; argument #2 to the function
PUSH 1 ; argument #1 to the function
MOV EAX, 0xDEADBEEF ; put the address of the function to call
CALL EAX
Note that the argument are pushed in reverse order. After the CALL instruction EAX contains the return value, and the stack will be just like it was before (i.e. the function pops its own arguments). The ECX and EDX registers may contain garbage, so don't rely on them keeping their values after the call.
A direct CALL instruction won't work, because those are position dependent.
To avoid zeros in the address itself try any of the null-free tricks for x86 shellcode, there are many out there but my favorite (albeit lengthy) is encoding the values using XOR instructions:
MOV EAX, 0xDEADBEEF ^ 0xFFFFFFFF ; your value xor'ed against an arbitrary mask
XOR EAX, 0xFFFFFFFF ; the arbitrary mask
You can also try NEG EAX or NOT EAX (sign inversion and bit flipping) to see if they work, it's much cheaper (two bytes each).
You can get help on the different API functions you can call here: http://msdn.microsoft.com
The most important ones you'll need are probably the following:
WinExec(): http://msdn.microsoft.com/en-us/library/ms687393(VS.85).aspx
LoadLibrary(): http://msdn.microsoft.com/en-us/library/windows/desktop/ms684175(v=vs.85).aspx
GetProcAddress(): http://msdn.microsoft.com/en-us/library/ms683212%28v=VS.85%29.aspx
The first launches a command, the next two are for loading DLL files and getting the addresses of its functions.
Here's a complete tutorial on writing Windows shellcodes: http://www.codeproject.com/Articles/325776/The-Art-of-Win32-Shellcoding
Assembly language is defined by your processor, and assembly syntax is defined by the assembler (hence, at&t, and intel syntax) The main difference (at least i think it used to be...) is that windows is real-mode (call the actual interrupts to do stuff, and you can use all the memory accessible to your computer, instead of just your program) and linux is protected mode (You only have access to memory in your program's little cubby of memory, and you have to call int 0x80 and make calls to the kernel, instead of making calls to the hardware and bios) Anyway, hello world type stuff would more-or-less be the same between linux and windows, as long as they are compatible processors.
To get the shellcode from your program you've made, just load it into your target system's
debugger (gdb for linux, and debug for windows) and in debug, type d (or was it u? Anyway, it should say if you type h (help)) and between instructions and memory will be the opcodes.
Just copy them all over to your text editor into one string, and maybe make a program that translates them all into their ascii values. Not sure how to do this in gdb tho...
Anyway, to make it into a bof exploit, enter aaaaa... and keep adding a's until it crashes
from a buffer overflow error. But find exactly how many a's it takes to crash it. Then, it should tell you what memory adress that was. Usually it should tell you in the error message. If it says '9797[rest of original return adress]' then you got it. Now u gotta use ur debugger to find out where this was. disassemble the program with your debugger and look for where scanf was called. Set a breakpoint there, run and examine the stack. Look for all those 97's (which i forgot to mention is the ascii number for 'a'.) and see where they end. Then remove breakpoint and type the amount of a's you found out it took (exactly the amount. If the error message was "buffer overflow at '97[rest of original return adress]" then remove that last a, put the adress you found examining the stack, and insert your shellcode. If all goes well, you should see your shellcode execute.
Happy hacking...