I am currently writing about stack overflows and created the following program below (for reference, I am using a 32-bit linux machine):
#include <stdio.h>
#include <string.h>
int main () {
char foo[16];
strcpy(foo, "AAAAAAAAAAAAAAAABBBBC");
}
This program was compiled with
gcc stack_buffer_2.c -g -o stack_buffer_2 -fno-stack-protector
What this program does is:
fill all "foo" array with 'A' chars
fill saved EBP with 'B' chars
write 'C' char (plus null char) to saved ESP
I verified the behavior of this program in GDB andhere is what I got:
(gdb) run
Program received signal SIGSEGV, Segmentation fault.
0xb7e30043 in ?? () from /lib/i386-linux-gnu/libc.so.6
The program tried to jump to 0xb7e30043 (overflown value, with 0x43 and null terminator) instead of the original value 0xb7e31637, which is a return point from __libc_start_main, in this context.
Here is where things get tricky for me. Doing a "info proc mappings", I see that 0xb7e30043 seems to be like "valid" memory (inside /lib/i386-linux-gnu/libc.so.6)
I am aware that segfaults can be triggered by memory I don't "own", or to unmapped segments. But in this case, since this is inside the memory space from my application, and libc is mapped to it, why I am getting a segfault?
I suspect this is due to some kind of misalignment, since after the overflow EIP is pointing to a random instruction inside libc. If it is, how can I adjust it? What are the constraints behind it, and who enforces it? The kernel, the MMU?
The problem was that the instruction at 0xb7e30043 was invalid.
Here is the current instruction:
0xb7e30043: add %cl,(%eax)
I looked into the eax value , and it is zero. From there, things are easier. The process tried to access memory at 0x0 and failed with a segfault.
Related
I have a few questions about Stack Guard and SSP protections. First question is about Stack Guard and its three types of canaries, if I am correctly - terminator, random and random XOR.
I'd like to know, how to disabled Stack Guard on x86 Linux system? Somewhere I read, it's possible with this command, while compiling with gcc '-disable-stackguard-randomization', it's same like with this command for enable '-enable-stackguard-randomization', both doesn't work. If needed, my gcc version is 4.8.2.
Next question about Stack guard, when I will able to enable/disable it, how can I set, which type of canaries I want to use? What I read, terminator canaries are used by default, for random I have to compiled with '-enable-stackguard-randomization', but how about random XOR? (Or with null 0x00000000)
Now about SSP(ProPolice), I know, for random canary I have to compiled with 'fstack-protector-all', but how about terminator, is it same as in Stack Guard, by default?
Last one, if anyone of you, can tell me, where I can find random canary in memory. For example, I have this scenario - compiled C program, like 'gcc -g example.c -o example -fstack-protector-all', so with random canaries. Let's say, I'm able to get address of canary, after every execution. So expect, I have: Canary = 0x1ae3f900. From a different papers, I get some info, that canary is located in .bss segment. So I get address of .bss segment using readelf: 'readelf -a ./example | grep bss'. It's 080456c9. In gdb I set some breakpoints, to get address of canary, but when I check .bss address x/20x 0x080456c9, all I see are only 0x00000000 addresses, but
canary is nowhere. Plus, I checked __stack_chk_fail's if it isn't there, but with same result, I can't see it there. I get address of stack_chk_fail from PLT/GOT.
Thank in advance for your answer and time.
Stack Smashing Protection (SSP) is an improvement over StackGuard. SSP was first implemented in gcc 4.1.
I'd like to know, how to disabled Stack Guard on x86 Linux system?
Use -fno-stack-protector to disable the userland SSP.
The --disable-stackguard-randomization and --enable-stackguard-randomization are build options for glibc source code.
when I will able to enable/disable it, how can I set, which type of
canaries I want to use?
This is not configurable in gcc as far as I know. Since glibc 2.10, the stack canary is generated in a function called _dl_setup_stack_chk_guard. Here is some part of its code:
if (dl_random == NULL)
{
ret.bytes[sizeof (ret) - 1] = 255;
ret.bytes[sizeof (ret) - 2] = '\n';
}
else
{
memcpy (ret.bytes, dl_random, sizeof (ret));
ret.num &= ~(uintptr_t) 0xff;
}
dl_random holds the address of the auxiliary vector entry for AT_RANDOM, which is a 16-byte random value initialized by the kernel while creating the process. If you are running on a kernel or an emulator that doesn't initialize AT_RANDOM, the check dl_random == NULL would be true and the canary used is the terminator value with the first and second most significant bytes initialized to 255 and \n, respectively. All other bytes are zero. Usually AT_RANDOM is initialized by the kernel and so the least 7 significant bytes of AT_RANDOM are copied. The last byte of canary is set to zero.
So if you want to use a particular method to generate the canary, you can change this code and build you own glibc.
As an alternative method, #PeterCordes have suggested in the comments to write your canary value to memory location %%fs:0x28 (see the code below) at the top of the main function and restore the runtime-generated canary just before returning from main.
Now about SSP(ProPolice), I know, for random canary I have to compiled
with 'fstack-protector-all', but how about terminator, is it same as
in Stack Guard, by default?
All variants of the -fstack-protector option use SSP. These don't affect how the canary is generated.
Last one, if anyone of you, can tell me, where I can find random
canary in memory.
The canary is generated dynamically early at process startup; you can't use readelf to get the canary. According to this article, you can use the following code to get the canary when compiling for i386:
int read_canary()
{
int val = 0;
__asm__("movl %%gs:0x14, %0;"
: "=r"(val)
:
:);
return val;
}
and for x86_64:
long read_canary()
{
long val = 0;
__asm__("movq %%fs:0x28, %0;"
: "=r"(val)
:
:);
return val;
}
I am using the esp value of kernel stack to calculate the process descriptor pointer value.
According to ULK book, I just need to mask 13 least significant bits of esp to obtain the base address of the thread_info structure.
My test is:
write a kernel module because I need to get value of kernel stack
In the kernel init function, get the value of kernel stack
use following formula to get the process descriptor pointer of the process running on the CPU: *((unsigned int*) esp & 0xffffe000)
use the current macro, print out its value.
I think the value of step3 should be same as the value of step 4.
But my experiment results shows: sometimes they are same, and sometimes they are different. Could any explain why? Or am I missing anything?
This is because at the base of the kernel stack you will find a struct thread_info instance (platform dependent) and not a struct task_struct. The current() macro provides a pointer to the current task_struct.
Try the following:
struct thread_info *info = (struct thread_info*)(esp & 0xfffe000);
struct task_struct *my_current = info->task;
Now you can compare my_current with current().
Finally, I solved this problem. Everything is correct expect for the size of kernel stack. My kernel use 4KB stack instead of 8KB stack. So I just need to mask low 12 bits of the ESP.
Thanks for all the suggestions and answer!
I'm using MinGW64 build based on GCC 4.6.1 for Windows 64bit target. I'm playing around with the new Intel's AVX instructions. My command line arguments are -march=corei7-avx -mtune=corei7-avx -mavx.
But I started running into segmentation fault errors when allocating local variables on the stack. GCC uses the aligned moves VMOVAPS and VMOVAPD to move __m256 and __m256d around, and these instructions require 32-byte alignment. However, the stack for Windows 64bit has only 16 byte alignment.
How can I change the GCC's stack alignment to 32 bytes?
I have tried using -mstackrealign but to no avail, since that aligns only to 16 bytes. I couldn't make __attribute__((force_align_arg_pointer)) work either, it aligns to 16 bytes anyway. I haven't been able to find any other compiler options that would address this. Any help is greatly appreciated.
EDIT:
I tried using -mpreferred-stack-boundary=5, but GCC says that 5 is not supported for this target. I'm out of ideas.
I have been exploring the issue, filed a GCC bug report, and found out that this is a MinGW64 related problem. See GCC Bug#49001. Apparently, GCC doesn't support 32-byte stack alignment on Windows. This effectively prevents the use of 256-bit AVX instructions.
I investigated a couple ways how to deal with this issue. The simplest and bluntest solution is to replace of aligned memory accesses VMOVAPS/PD/DQA by unaligned alternatives VMOVUPS etc. So I learned Python last night (very nice tool, by the way) and pulled off the following script that does the job with an input assembler file produced by GCC:
import re
import fileinput
import sys
# fix aligned stack access
# replace aligned vmov* by unaligned vmov* with 32-byte aligned operands
# see Intel's AVX programming guide, page 39
vmova = re.compile(r"\s*?vmov(\w+).*?((\(%r.*?%ymm)|(%ymm.*?\(%r))")
aligndict = {"aps" : "ups", "apd" : "upd", "dqa" : "dqu"};
for line in fileinput.FileInput(sys.argv[1:],inplace=1):
m = vmova.match(line)
if m and m.group(1) in aligndict:
s = m.group(1)
print line.replace("vmov"+s, "vmov"+aligndict[s]),
else:
print line,
This approach is pretty safe and foolproof. Though I observed a performance penalty on rare occasions. When the stack is unaligned, the memory access crosses the cache line boundary. Fortunately, the code performs as fast as aligned accesses most of the time. My recommendation: inline functions in critical loops!
I also attempted to fix the stack allocation in every function prolog using another Python script, trying to align it always at the 32-byte boundary. This seems to work for some code, but not for other. I have to rely on the good will of GCC that it will allocate aligned local variables (with respect to the stack pointer), which it usually does. This is not always the case, especially when there is a serious register spilling due to the necessity to save all ymm register before a function call. (All ymm registers are callee-save). I can post the script if there's an interest.
The best solution would be to fix GCC MinGW64 build. Unfortunately, I have no knowledge of its internal workings, just started using it last week.
You can get the effect you want by
Declaring your variables not as variables, but as fields in a struct
Declaring an array that is larger than the structure by an appropriate amount of padding
Doing pointer/address arithmetic to find a 32 byte aligned address in side the array
Casting that address to a pointer to your struct
Finally using the data members of your struct
You can use the same technique when malloc() does not align stuff on the heap appropriately.
E.g.
void foo() {
struct I_wish_these_were_32B_aligned {
vec32B foo;
char bar[32];
}; // not - no variable definition, just the struct declaration.
unsigned char a[sizeof(I_wish_these_were_32B_aligned) + 32)];
unsigned char* a_aligned_to_32B = align_to_32B(a);
I_wish_these_were_32B_aligned* s = (I_wish_these_were_32B_aligned)a_aligned_to_32B;
s->foo = ...
}
where
unsigned char* align_to_32B(unsiged char* a) {
uint64_t u = (unit64_t)a;
mask_aligned32B = (1 << 5) - 1;
if (u & mask_aligned32B == 0) return (unsigned char*)u;
return (unsigned char*)((u|mask_aligned_32B) + 1);
}
I just ran in the same issue of having segmentation faults when using AVX inside my functions. And it was also due to the stack misalignment. Given the fact that this is a compiler issue (and the options that could help are not available in Windows), I worked around the stack usage by:
Using static variables (see this issue). Given the fact that they are not stored in the stack, you can force their alignment by using __attribute__((align(32))) in your declaration. For example: static __m256i r __attribute__((aligned(32))).
Inlining the functions/methods receiving/returning AVX data. You can force GCC to inline your function/method by adding inline and __attribute__((always_inline)) to your function prototype/declaration. Inlining your functions increase the size of your program, but they also prevent the function from using the stack (and hence, avoids the stack-alignment issue). Example: inline __m256i myAvxFunction(void) __attribute__((always_inline));.
Be aware that the usage of static variables is no thread-safe, as mentioned in the reference. If you are writing a multi-threaded application you may have to add some protection for your critical paths.
I would like to test a buffer-overflow by writing "Hello World" to console (using Windows XP 32-Bit). The shellcode needs to be null-free in order to be passed by "scanf" into the program I want to overflow. I've found plenty of assembly-tutorials for Linux, however none for Windows. Could someone please step me through this using NASM? Thxxx!
Assembly opcodes are the same, so the regular tricks to produce null-free shellcodes still apply, but the way to make system calls is different.
In Linux you make system calls with the "int 0x80" instruction, while on Windows you must use DLL libraries and do normal usermode calls to their exported functions.
For that reason, on Windows your shellcode must either:
Hardcode the Win32 API function addresses (most likely will only work on your machine)
Use a Win32 API resolver shellcode (works on every Windows version)
If you're just learning, for now it's probably easier to just hardcode the addresses you see in the debugger. To make the calls position independent you can load the addresses in registers. For example, a call to a function with 4 arguments:
PUSH 4 ; argument #4 to the function
PUSH 3 ; argument #3 to the function
PUSH 2 ; argument #2 to the function
PUSH 1 ; argument #1 to the function
MOV EAX, 0xDEADBEEF ; put the address of the function to call
CALL EAX
Note that the argument are pushed in reverse order. After the CALL instruction EAX contains the return value, and the stack will be just like it was before (i.e. the function pops its own arguments). The ECX and EDX registers may contain garbage, so don't rely on them keeping their values after the call.
A direct CALL instruction won't work, because those are position dependent.
To avoid zeros in the address itself try any of the null-free tricks for x86 shellcode, there are many out there but my favorite (albeit lengthy) is encoding the values using XOR instructions:
MOV EAX, 0xDEADBEEF ^ 0xFFFFFFFF ; your value xor'ed against an arbitrary mask
XOR EAX, 0xFFFFFFFF ; the arbitrary mask
You can also try NEG EAX or NOT EAX (sign inversion and bit flipping) to see if they work, it's much cheaper (two bytes each).
You can get help on the different API functions you can call here: http://msdn.microsoft.com
The most important ones you'll need are probably the following:
WinExec(): http://msdn.microsoft.com/en-us/library/ms687393(VS.85).aspx
LoadLibrary(): http://msdn.microsoft.com/en-us/library/windows/desktop/ms684175(v=vs.85).aspx
GetProcAddress(): http://msdn.microsoft.com/en-us/library/ms683212%28v=VS.85%29.aspx
The first launches a command, the next two are for loading DLL files and getting the addresses of its functions.
Here's a complete tutorial on writing Windows shellcodes: http://www.codeproject.com/Articles/325776/The-Art-of-Win32-Shellcoding
Assembly language is defined by your processor, and assembly syntax is defined by the assembler (hence, at&t, and intel syntax) The main difference (at least i think it used to be...) is that windows is real-mode (call the actual interrupts to do stuff, and you can use all the memory accessible to your computer, instead of just your program) and linux is protected mode (You only have access to memory in your program's little cubby of memory, and you have to call int 0x80 and make calls to the kernel, instead of making calls to the hardware and bios) Anyway, hello world type stuff would more-or-less be the same between linux and windows, as long as they are compatible processors.
To get the shellcode from your program you've made, just load it into your target system's
debugger (gdb for linux, and debug for windows) and in debug, type d (or was it u? Anyway, it should say if you type h (help)) and between instructions and memory will be the opcodes.
Just copy them all over to your text editor into one string, and maybe make a program that translates them all into their ascii values. Not sure how to do this in gdb tho...
Anyway, to make it into a bof exploit, enter aaaaa... and keep adding a's until it crashes
from a buffer overflow error. But find exactly how many a's it takes to crash it. Then, it should tell you what memory adress that was. Usually it should tell you in the error message. If it says '9797[rest of original return adress]' then you got it. Now u gotta use ur debugger to find out where this was. disassemble the program with your debugger and look for where scanf was called. Set a breakpoint there, run and examine the stack. Look for all those 97's (which i forgot to mention is the ascii number for 'a'.) and see where they end. Then remove breakpoint and type the amount of a's you found out it took (exactly the amount. If the error message was "buffer overflow at '97[rest of original return adress]" then remove that last a, put the adress you found examining the stack, and insert your shellcode. If all goes well, you should see your shellcode execute.
Happy hacking...
I never thought I'd be posting an assembly question. :-)
In GCC, there is an extended version of the asm function. This function can take four parameters: assembly-code, output-list, input-list and overwrite-list.
My question is, are the registers in the overwrite-list zeroed out? What happens to the values that were previously in there (from other code executing).
Update: In considering my answers thus far (thank you!), I want to add that though a register is listed in the clobber-list, it (in my instance) is being used in a pop (popl) command. There is no other reference.
No, they are not zeroed out. The purpose of the overwrite list (more commonly called the clobber list) is to inform GCC that, as a result of the asm instructions the register(s) listed in the clobber list will be modified, and so the compiler should preserve any which are currently live.
For example, on x86 the cpuid instruction returns information in four parts using four fixed registers: %eax, %ebx, %ecx and %edx, based on the input value of %eax. If we were only interested in the result in %eax and %ebx, then we might (naively) write:
int input_res1 = 0; // also used for first part of result
int res2;
__asm__("cpuid" : "+a"(input_res1), "=b"(res2) );
This would get the first and second parts of the result in C variables input_res1 and res2; however if GCC was using %ecx and %edx to hold other data; they would be overwritten by the cpuid instruction without gcc knowing. To prevent this; we use the clobber list:
int input_res1 = 0; // also used for first part of result
int res2;
__asm__("cpuid" : "+a"(input_res1), "=b"(res2)
: : "%ecx", "%edx" );
As we have told GCC that %ecx and %edx will be overwritten by this asm call, it can handle the situation correctly - either by not using %ecx or %edx, or by saving their values to the stack before the asm function and restoring after.
Update:
With regards to your second question (why you are seeing a register listed in the clobber list for a popl instruction) - assuming your asm looks something like:
__asm__("popl %eax" : : : "%eax" );
Then the code here is popping an item off the stack, however it doesn't care about the actual value - it's probably just keeping the stack balanced, or the value isn't needed in this code path. By writing this way, as opposed to:
int trash // don't ever use this.
__asm__("popl %0" : "=r"(trash));
You don't have to explicitly create a temporary variable to hold the unwanted value. Admittedly in this case there isn't a huge difference between the two, but the version with the clobber makes it clear that you don't care about the value from the stack.
If by "zeroed out" you mean "the values in the registers are replaced with 0's to prevent me from knowing what some other function was doing" then no, the registers are not zeroed out before use. But it shouldn't matter because you're telling GCC you plan to store information there, not that you want to read information that's currently there.
You give this information to GCC so that (reading the documentation) "you need not guess which registers or memory locations will contain the data you want to use" when you're finished with the assembly code (eg., you don't have to remember if the data will be in the stack register, or some other register).
GCC needs a lot of help for assembly code because "The compiler ... does not parse the assembler instruction template and does not know what it means or even whether it is valid assembler input. The extended asm feature is most often used for machine instructions the compiler itself does not know exist."
Update
GCC is designed as a multi-pass compiler. Many of the passes are in fact entirely different programs. A set of programs forming "the compiler" translate your source from C, C++, Ada, Java, etc. into assembly code. Then a separate program (gas, for GNU Assembler) takes that assembly code and turns it into a binary (and then ld and collect2 do more things to the binary). Assembly blocks exist to pass text directly to gas, and the clobber-list (and input list) exist so that the compiler can do whatever set up is needed to pass information between the C, C++, Ada, Java, etc. side of things and the gas side of things, and to guarantee that any important information currently in registers can be protected from the assembly block by copying it to memory before the assembly block runs (and copying back from memory afterward).
The alternative would be to save and restore every register for every assembly code block. On a RISC machine with a large number of registers that could get expensive (the Itanium has 128 general registers, another 128 floating point registers and 64 1-bit registers, for instance).
It's been a while since I've written any assembly code. And I have much more experience using GCC's named registers feature than doing things with specific registers. So, looking at an example:
#include <stdio.h>
long foo(long l)
{
long result;
asm (
"movl %[l], %[reg];"
"incl %[reg];"
: [reg] "=r" (result)
: [l] "r" (l)
);
return result;
}
int main(int argc, char** argv)
{
printf("%ld\n", foo(5L));
}
I have asked for an output register, which I will call reg inside the assembly code, and that GCC will automatically copy to the result variable on completion. There is no need to give this variable different names in C code vs assembly code; I only did it to show that it is possible. Whichever physical register GCC decides to use -- whether it's %%eax, %%ebx, %%ecx, etc. -- GCC will take care of copying any important data from that register into memory when I enter the assembly block so that I have full use of that register until the end of the assembly block.
I have also asked for an input register, which I will call l both in C and in assembly. GCC promises that whatever physical register it decides to give me will have the value currently in the C variable l when I enter the assembly block. GCC will also do any needed recordkeeping to protect any data that happens to be in that register before I enter the assembly block.
What if I add a line to the assembly code? Say:
"addl %[reg], %%ecx;"
Since the compiler part of GCC doesn't check the assembly code it won't have protected the data in %%ecx. If I'm lucky, %%ecx may happen to be one of the registers GCC decided to use for %[reg] or %[l]. If I'm not lucky, I will have "mysteriously" changed a value in some other part of my program.
I suspect the overwrite list is just to give GCC a hint not to store anything of value in these registers across the ASM call; since GCC doesn't analyze what ASM you're giving it, and certain instructions have side-effects that touch other registers not explicitly named in the code, this is the way to tell GCC about it.