Nop Sled, can you explain it to me? - stack-overflow

I have been reading this book: Hacking, the art of exploitation
On page 140, the book explains the Nop Slide:
We’ll create a large array (or sled) of these NOP instructions and place it
before the shellcode; then, if the EIP register points to any address found in
the NOP sled, it will increment while executing each NOP instruction, one at
a time, until it finally reaches the shellcode. This means that as long as the
return address is overwritten with any address found in the NOP sled, the EIP
register will slide down the sled to the shellcode, which will execute properly.
But with this technique, we would overwrite the return address with 0x90,wouldn't we?. EIP will go to 0x90, causing a segfault.
So, can you explain this technique to me clearly? Thanks :)

No, you'll not rewrite return address with NOP sleds. Once you get the right offset, you have to rewrite return address with address, what points somewhere into your NOP instructions. And because NOP sled is placed before your shellcode, it will just slide down and execute your shellcode. So that 60 bytes long NOP sled is doing nothing.
It's because (you can find everything about it, in that book):
NOP is an assembly instruction that is short for 'no operation'. It is
a single-byte instruction that does absolutely nothing.

to add to #Yeez answer
a nop sled (as stated, no operation) is called a sled because; it kinda looks like one
[nop sleds][shellcode]
so if you "land" in a nop (aka x90, or the cliche also works xchg eax, eax (both of these are NOP's))
it will just "slide down" to the shellcode
if we happen to get here, on the 2nd nop:
[nop] [nop]; the last one
then we will just 'tick/sled/slide/crawl' along it until we are at our shellcode (by which time we execute that shellcode)
[nop] [nop] [nop] [shellcode]
^ go ^ go ^ go ^ execute
Testing that in RASM2
(radare2's tool, like msf-nasm, if you are used to that)
rasm2 -a x86 -b 32 "xchg eax, eax"

Related

x86 assembler pushf causes program to exit

I think my real problem is I don't completely understand the stack frame mechanism so I am looking to understand why the following code causes the program execution to resume at the end of the application.
This code is called from a C function which is several call levels deep and the pushf causes program execution to revert back several levels through the stack and completely exit the program.
Since my work around works as expected I would like to know why using the pushf instruction appears to be (I assume) corrupting the stack.
In the routines I usually setup and clean up the stack with :
sub rsp, 28h
...
add rsp, 28h
However I noticed that this is only necessary when the assembly code calls a C function.
So I tried removing this from both routines but it made no difference. SaveFlagsCmb is an assembly function but could easily be a macro.
The code represents an emulated 6809 CPU Rora (Rotate Right Register A).
PUBLIC Rora_I_A ; Op 46 - Rotate Right through Carry A reg
Rora_I_A PROC
sub rsp, 28h
; Restore Flags
mov cx, word ptr [x86flags]
push cx
popf
; Rotate Right the byte and save the FLAGS
rcr byte ptr [q_s+AREG], 1
; rcr only affects Carry. Save the Carry first in dx then
; add 0 to result to trigger Zero and Sign/Neg flags
pushf ; this causes jump to end of program ????
pop dx ; this line never reached
and dx, CF ; Save only Carry Flag
add [q_s+AREG], 0 ; trigger NZ flags
mov rcx, NF+ZF+CF ; Flag Mask NZ
Call SaveFlagsCmb ; NZ from the add and CF saved in dx
add rsp, 28h
ret
Rora_I_A ENDP
However if I use this code it works as expected:
PUBLIC Rora_I_A ; Op 46 - Rotate Right through Carry A reg
Rora_I_A PROC
; sub rsp, 28h ; works with or without this!!!
; Restore Flags
mov ah, byte ptr [x86flags+LSB]
sahf
; Rotate Right the byte and save the FLAGS
rcr byte ptr [q_s+AREG], 1
; rcr only affects Carry. Save the Carry first in dx then
; add 0 to result to trigger Zero and Sign/Neg flags
lahf
mov dl, ah
and dx, CF ; Save only Carry Flag
add [q_s+AREG], 0 ; trigger NZ flags
mov rcx, NF+ZF+CF ; Flag Mask NZ
Call SaveFlagsCmb ; NZ from the add and CF saved in dx
; add rsp, 28h ; works with or without this!!!
ret
Rora_I_A ENDP
Your reported behaviour doesn't really make sense. Mostly this answer is just providing some background not a real answer, and a suggestion not to use pushf/popf in the first place for performance reasons.
Make sure your debugging tools work properly and aren't being fooled by something into falsely showing a "jump" to somewhere. (And jump where exactly?)
There's little reason to mess around with 16-bit operand size, but that's probably not your problem.
In Visual Studio / MASM, apparently (according to OP's comment) pushf assembles as pushfw, 66 9C which pushes 2 bytes. Presumably popf also assembles as popfw, only popping 2 bytes into FLAGS instead of the normal 8 bytes into RFLAGS. Other assemblers are different.1
So your code should work. Unless you're accidentally setting some other bit in FLAGS that breaks execution? There are bits in EFLAGS/RFLAGS other than condition codes, including the single-step TF Trap Flag: debug exception after every instruction.
We know you're in 64-bit mode, not 32-bit compat mode, otherwise rsp wouldn't be a valid register. And running 64-bit machine code in 32-bit mode wouldn't explain your observations either.
I'm not sure how that would explain pushf being a jump to anywhere. pushf itself can't fault or jump, and if popf set TF then the instruction after popf would have caused a debug exception.
Are you sure you're assembling 64-bit machine code and running it in 64-bit mode? The only thing that would be different if a CPU decoded your code in 32-bit mode should be the REX prefix on sub rsp, 28h, and the RIP-relative addressing mode on [x86flags] decoding as absolute (which would presumably fault). So I don't think that could explain what you're seeing.
Are you sure you're single-stepping by instructions (not source lines or C statements) with a debugger to test this?
Use a debugger to look at the machine code as you single-step. This seem really weird.
Anyway, it seems like a very low-performance idea to use pushf / popf at all, and also to be using 16-bit operand-size creating false dependencies.
e.g. you can set x86 CF with movzx ecx, word ptr [x86flags] / bt ecx, CF.
You can capture the output CF with setc cl
Also, if you're going to do multiple things to the byte from the guest memory, load it into an x86 register. A memory-destination RCR and a memory-destination ADD are unnecessarily slow vs. load / rcr / ... / test reg,reg / store.
LAHF/SAHF may be useful, but you can also do without them too for many cases. popf is quite slow (https://agner.org/optimize/) and it forces a round trip through memory. However, there is one condition-code outside the low 8 in x86 FLAGS: OF (signed overflow). asm-source compatibility with 8080 is still hurting x86 in 2019 :(
You can restore OF from a 0/1 integer with add al, 127: if AL was originally 1, it will overflow to 0x80, otherwise it won't. You can then restore the rest of the condition codes with SAHF. You can extract OF with seto al. Or you can just use pushf/popf.
; sub rsp, 28h ; works with or without this!!!
Yes of course. You have a leaf function that doesn't use any stack space.
You only need to reserve another 40 bytes (align the stack + 32 bytes of shadow space) if you were going to make another function call from this function.
Footnote 1: pushf/popf in other assemblers:
In NASM, pushf/popf default to the same width as other push/pop instructions: 8 bytes in 64-bit mode. You get the normal encoding without an operand-size prefix. (https://www.felixcloutier.com/x86/pushf:pushfd:pushfq)
Like for integer registers, both 16 and 64-bit operand-size for pushf/popf are encodeable in 64-bit mode, but 32-bit operand size isn't.
In NASM, your code would be broken because push cx / popf would push 2 bytes and pop 8, popping 6 bytes of your return address into RFLAGS.
But apparently MASM isn't like that. Probably a good idea to use explicit operand-size specifiers anyway, like pushfw and popfw if you use it at all, to make sure you get the 66 9C encoding, not just 9C pushfq.
Or better, use pushfq and pop rcx like a normal person: only write to 8 or 16-bit partial registers when you need to, and keep the stack qword-aligned. (16-byte alignment before call, 8-byte alignment always.)
I believe this is a bug in Visual Studio. I'm using 2022, so it's an issue that's been around for a while.
I don't know exactly what is triggering it, however stepping over one specific pushf in my code had the same symptoms, albeit with the code actually working.
Putting a breakpoint on the line after the pushf did break, and allowed further debugging of my app. Adding a push ax, pop ax before the pushf also seemed to fix the issue. So it must be a Visual Studio issue.
At this point I think MASM and debugging in Visual Studio has pretty much been abandoned. Any suggestions for alternatives for developing dlls on Windows would be appreciated!

Translating Go assembler to NASM

I came across the following Go code:
type Element [12]uint64
//go:noescape
func CSwap(x, y *Element, choice uint8)
//go:noescape
func Add(z, x, y *Element)
where the CSwap and Add functions are basically coming from an assembly, and look like the following:
TEXT ·CSwap(SB), NOSPLIT, $0-17
MOVQ x+0(FP), REG_P1
MOVQ y+8(FP), REG_P2
MOVB choice+16(FP), AL // AL = 0 or 1
MOVBLZX AL, AX // AX = 0 or 1
NEGQ AX // RAX = 0x00..00 or 0xff..ff
MOVQ (0*8)(REG_P1), BX
MOVQ (0*8)(REG_P2), CX
// Rest removed for brevity
TEXT ·Add(SB), NOSPLIT, $0-24
MOVQ z+0(FP), REG_P3
MOVQ x+8(FP), REG_P1
MOVQ y+16(FP), REG_P2
MOVQ (REG_P1), R8
MOVQ (8)(REG_P1), R9
MOVQ (16)(REG_P1), R10
MOVQ (24)(REG_P1), R11
// Rest removed for brevity
What I try to do is that translate the assembly to a syntax that is more familiar to me (I think mine is more like NASM), while the above syntax is Go assembler. Regarding the Add method I didn't have much problem, and translated it correctly (according to test results). It looks like this in my case:
.text
.global add_asm
add_asm:
push r12
push r13
push r14
push r15
mov r8, [reg_p1]
mov r9, [reg_p1+8]
mov r10, [reg_p1+16]
mov r11, [reg_p1+24]
// Rest removed for brevity
But, I have a problem when translating the CSwap function, I have something like this:
.text
.global cswap_asm
cswap_asm:
push r12
push r13
push r14
mov al, 16
mov rax, al
neg rax
mov rbx, [reg_p1+(0*8)]
mov rcx, [reg_p2+(0*8)]
But this doesn't seem to be quite correct, as I get error when compiling it. Any ideas how to translate the above CSwap assembly part to something like NASM?
EDIT (SOLUTION):
Okay, after the two answers below, and some testing and digging, I found out that the code uses the following three registers for parameter passing:
#define reg_p1 rdi
#define reg_p2 rsi
#define reg_p3 rdx
Accordingly, rdx has the value of the choice parameter. So, all that I had to do was use this:
movzx rax, dl // Get the lower 8 bits of rdx (reg_p3)
neg rax
Using byte [rdx] or byte [reg_3] was giving an error, but using dl seems to work fine for me.
Basic docs about Go's asm: https://golang.org/doc/asm. It's not totally equivalent to NASM or AT&T syntax: FP is a pseudo-register name for whichever register it decides to use as the frame pointer. (Typically RSP or RBP). Go asm also seems to omit function prologue (and probably epilogue) instructions. As #RossRidge comments, it's a bit more like a internal representation like LLVM IR than truly asm.
Go also has its own object-file format, so I'm not sure you can make Go-compatible object files with NASM.
If you want to call this function from something other than Go, you'll also need to port the code to a different calling convention. Go appears to be using a stack-args calling convention even for x86-64, unlike the normal x86-64 System V ABI or the x86-64 Windows calling convention. (Or maybe those mov function args into REG_P1 and so on instructions disappear when Go builds this source for a register-arg calling convention?)
(This is why you could you had to use movzx eax, dl instead of loading from the stack at all.)
BTW, rewriting this code in C instead of NASM would probably make even more sense if you want to use it with C. Small functions are best inlined and optimized away by the compiler.
It would be a good idea to check your translation, or get a starting point, by assembling with the Go assembler and using a disassembler.
objdump -drwC -Mintel or Agner Fog's objconv disassembler would be good, but they don't understand Go's object-file format. If Go has a tool to extract the actual machine code or get it in an ELF object file, do that.
If not, you could use ndisasm -b 64 (which treats input files as flat binaries, disassembling all the bytes as if they were instructions). You can specify an offset/length if you can find out where the function starts. x86 instructions are variable length, and disassembly will likely be "out of sync" at the start of the function. You might want to add a bunch of single-byte NOP instructions (kind of a NOP sled) for the disassembler, so if it decodes some 0x90 bytes as part of an immediate or disp32 for a long instruction that was really not part of the function, it will be in sync. (But the function prologue will still be messed up).
You might add some "signpost" instructions to your Go asm functions to make it easy to find the right place in the mess of crazy asm from disassembling metadata as instructions. e.g. put a pmuludq xmm0, xmm0 in there somewhere, or some other instruction with a unique mnemonic that you can search for which the Go code doesn't include. Or an instruction with an immediate that will stand out, like addq $0x1234567, SP. (An instruction that will crash so you don't forget to take it out again is good here.)
Or you could use gdb's built-in disassembler: add an instruction that will segfault (like a load from a bogus absolute address (movl 0, AX null-pointer deref), or a register holding a non-pointer value e.g. movl (AX), AX). Then you'll have an instruction-pointer value for the instructions in memory, and can disassemble from some point behind that. (Probably the function start will be 16-byte aligned.)
Specific instructions.
MOVBLZX AL, AX reads AL, so that's definitely an 8-bit operand. The size for AX is given by the L part of the mnemonic, meaning long for 32 bit, like in GAS AT&T syntax. (The gas mnemonic for that form of movzx is movzbl %al, %eax). See What does cltq do in assembly? for a table of cdq / cdqe and the AT&T equivalent, and the AT&T / Intel mnemonic for the equivalent MOVSX instruction.
The NASM instruction you want is movzx eax, al. Using rax as the destination would be a waste of a REX prefix. Using ax as the destination would be a mistake: it wouldn't zero-extend into the full register, and would leave whatever high garbage. Go asm syntax for x86 is very confusing when you're not used to it, because AX can mean AX, EAX, or RAX depending on the operand size.
Obviously mov rax, al isn't a possibility: Like most instructions, mov requires both its operands to be the same size. movzx is one of the rare exceptions.
MOVB choice+16(FP), AL is a byte load into AL, not an immediate move. choice+16 is a an offset from FP. This syntax is basically the same as AT&T addressing modes, with FP as a register and choice as an assemble-time constant.
FP is a pseudo-register name. It's pretty clear that it should simply be loading the low byte of the 3rd arg-passing slot, because choice is the name of a function arg. (In Go asm, choice is just syntactic sugar, or a constant defined as zero.)
Before a call instruction, rsp points at the first stack arg, so that + 16 is the 3rd arg. It appears that FP is that base address (and might actually be rsp+8 or something). After a call (which pushes an 8 byte return address), the 3rd stack arg is at rsp + 24. After more pushes, the offset will be even larger, so adjust as necessary to reach the right location.
If you're porting this function to be called with a standard calling convention, the 3 integer args will be passed in registers, with no stack args. Which 3 registers depends on whether you're building for Windows vs. non-Windows. (See Agner Fog's calling conventions doc: http://agner.org/optimize/)
BTW, a byte load into AL and then movzx eax, al is just dumb. Much more efficient on all modern CPUs to do it in one step with
movzx eax, byte [rsp + 24] ; or rbp+32 if you made a stack frame.
I hope the source in the question is from un-optimized Go compiler output? Or the assembler itself makes such optimizations?
I think you can translate these as just
mov rbx, [reg_p1]
mov rcx, [reg_p2]
Unless I'm missing some subtlety, the offsets which are zero can just be ignored. The *8 isn't a size hint since that's already in the instruction.
The rest of your code looks wrong though. The MOVB choice+16(FP), AL in the original is supposed to be fetching the choice argument into AL, but you're setting AL to a constant 16, and the code for loading the other arguments seems to be completely missing, as is the code for all of the arguments in the other function.

How can I print numbers in my assembly program

I have a problem with my assembly program. My assembly compiler is NASM. The source and the outputs are in this picture:
The problem is that I can't print numbers from calculations with the extern C function printf(). How can I do it?
The output should be "Ergebnis: 8" but it isn't correct.
In NASM documentation it is pointed that NASM Requires Square Brackets For Memory References. When you write label name without bracket NASM gives its memory address (or offset as it is called sometimes). So, mov eax, val_1 it means that eax register gets val_1's offset. When you add eax, val_2, val_2 offset is added to val_1 offset and you get the result you see.
Write instead:
mov eax, [val_1]
add eax, [val_2]
And you shoul get 8 in eax.
P.S. It seems that you have just switched to NASM from MASM or TASM.
There are a lot of guides for switchers like you. See for example nice tutorials here and here.

How to use int3( VEH or SEH )

Know I want to replace function prologue with jmp to jump to my allocate zone(VirtualAllocateEx). But function prologue just have 3 bytes, and jmp have 5 bytes.
like this:
55 `push ebp`
8B EC `mov ebp, esp`
833D C4354200 02 `cmp dword ptr ds:[4235C4],2`
E9 AD00000000 `jmp` 00140000 // replace above three instructions
If I want to use jmp to cover function prologue, the third instruction after function prologue must be covered.
So know I want to use int3 to replace function prologue to to jump to my allocate zone or any address, how can I do it?
I try to use VEH or SEH to do so, but I can't figure out how to make it.
You need to write the original code (the one you quoted) on another memory location (just allocate something).
Write it while saving some space for the additional OpCodes (your custom new code).
It doesn't have to fit exactly as you're allowed to fill the unused bytes with NOP (0x90 if I'm not mistaken).
Now, jump to this code from the original code.
I've been doing this stuff when I was making game trainers years ago.. Works very well.
On thing to note: Your reWritten code should, at the end, jump back to the original place to continue the code flow.
Let me know if it's unclear.

cedecl calling convention -- compiled asm instructions cause crash

Treat this more as pseudocode than anything. If there's some macro or other element that you feel should be included, let me know.
I'm rather new to assembly. I programmed on a pic processor back in college, but nothing since.
The problem here (segmentation fault) is the first instruction after "Compile function entrance, setup stack frame." or "push %ebp". Here's what I found out about those two instructions:
http://unixwiz.net/techtips/win32-callconv-asm.html
Save and update the %ebp :
Now that we're in the new function, we need a new local stack frame pointed to by %ebp, so this is done by saving the current %ebp (which belongs to the previous function's frame) and making it point to the top of the stack.
push ebp
mov ebp, esp // ebp « esp
Once %ebp has been changed, it can now refer directly to the function's arguments as 8(%ebp), 12(%ebp). Note that 0(%ebp) is the old base pointer and 4(%ebp) is the old instruction pointer.
Here's the code. This is from a JIT compiler for a project I'm working on. I'm doing this more for the learning experience than anything.
IL_CORE_COMPILE(avs_x86_compiler_compile)
{
X86GlobalData *gd = X86_GLOBALDATA(ctx);
ILInstruction *insn;
avs_debug(print("X86: Compiling started..."));
/* Initialize X86 Assembler opcode context */
x86_context_init(&gd->ctx, 4096, 1024*1024);
/* Compile function entrance, setup stack frame*/
x86_emit1(&gd->ctx, pushl, ebp);
x86_emit2(&gd->ctx, movl, esp, ebp);
/* Setup floating point rounding mode to integer truncation */
x86_emit2(&gd->ctx, subl, imm(8), esp);
x86_emit1(&gd->ctx, fstcw, disp(0, esp));
x86_emit2(&gd->ctx, movl, disp(0, esp), eax);
x86_emit2(&gd->ctx, orl, imm(0xc00), eax);
x86_emit2(&gd->ctx, movl, eax, disp(4, esp));
x86_emit1(&gd->ctx, fldcw, disp(4, esp));
for (insn=avs_il_tree_base(tree); insn != NULL; insn = insn->next) {
avs_debug(print("X86: Compiling instruction: %p", insn));
compile_opcode(gd, obj, insn);
}
/* Restore floating point rounding mode */
x86_emit1(&gd->ctx, fldcw, disp(0, esp));
x86_emit2(&gd->ctx, addl, imm(8), esp);
/* Cleanup stack frame */
x86_emit0(&gd->ctx, emms);
x86_emit0(&gd->ctx, leave);
x86_emit0(&gd->ctx, ret);
/* Link machine */
obj->run = (AvsRunnableExecuteCall) gd->ctx.buf;
return 0;
}
And when obj->run is called, it's called with obj as its only argument:
obj->run(obj);
If it helps, here are the instructions for the entire function call. It's basically an assignment operation: foo=3*0.2;. foo is pointing to a float in C.
0x8067990: push %ebp
0x8067991: mov %esp,%ebp
0x8067993: sub $0x8,%esp
0x8067999: fnstcw (%esp)
0x806799c: mov (%esp),%eax
0x806799f: or $0xc00,%eax
0x80679a4: mov %eax,0x4(%esp)
0x80679a8: fldcw 0x4(%esp)
0x80679ac: flds 0x806793c
0x80679b2: fsts 0x805f014
0x80679b8: fstps 0x8067954
0x80679be: fldcw (%esp)
0x80679c1: add $0x8,%esp
0x80679c7: emms
0x80679c9: leave
0x80679ca: ret
Edit: Like I said above, in the first instruction in this function, %ebp is void. This is also the instruction that causes the segmentation fault. Is that because it's void, or am I looking for something else?
Edit: Scratch that. I keep typing edp instead of ebp. Here are the values of ebp and esp.
(gdb) print $esp
$1 = (void *) 0xbffff14c
(gdb) print $ebp
$3 = (void *) 0xbffff168
Edit: Those values above are wrong. I should have used the 'x' command, like below:
(gdb) x/x $ebp
0xbffff168: 0xbffff188
(gdb) x/x $esp
0xbffff14c: 0x0804e481
Here's a reply from someone on a mailing list regarding this. Anyone care to illuminate what he means a bit? How do I check to see how the stack is set up?
An immediate problem I see is that the
stack pointer is not properly aligned.
This is 32-bit code, and the Intel
manual says that the stack should be
aligned at 32-bit addresses. That is,
the least significant digit in esp
should be 0, 4, 8, or c.
I also note that the values in ebp and
esp are very far apart. Typically,
they contain similar values --
addresses somewhere in the stack.
I would look at how the stack was set
up in this program.
He replied with corrections to the above comments. He was unable to see any problems after further input.
Another edit: Someone replied that the code page may not be marked executable. How can I insure it is marked as such?
The problem had nothing to do with the code. Adding -z execstack to the linker fixed the problem.
If push %ebp is causing a segfault, then your stack pointer isn't pointing at valid stack. How does control reach that point? What platform are you on, and is there anything odd about the runtime environment? At the entry to the function, %esp should point to the return address in the caller on the stack. Does it?
Aside from that, the whole function is pretty weird. You go out of your way to set the rounding bits in the fp control word, and then don't perform any operations that are affected by rounding. All the function does is copy some data, but uses floating-point registers to do it when you could use the integer registers just as well. And then there's the spurious emms, which you need after using MMX instructions, not after doing x87 computations.
Edit See Scott's (the original questioner) answer for the actual reason for the crash.

Resources