I would like to know, is there any Windows platform disassembler (software) can generate the assembly source code which is also compilable by an assembler?
Since disassembler can generate the assembly code based on an EXE file, is it possible the assembly code be used directly as a source code, then the source code be compiled by an assembler like NASM?
IDA can generate the source code. But in most cases you can't edit it. Assume the following code:
loc_401020:
ret
; ...
dd 0FFFFFFFFh, 0, 1, 401020h, 0
; ^^^^^^^ can you find it in big real program?
to insert any new bytes to you must either be shure that any sub_XXXX or loc_XXXX will remain at the same offset, either you must replace all its references to labels.
If you don't move any code, you don't need to recompile it - just patch and maybe extend the code section.
I think IDA is quite good at this.
Anyway, the main problem would be that the generated assembly code would be quite unreadable and very hard to mantain (no variable names, function names and signatures), so altough technically it would be ASM code, it's still better to use IDA as clever editor to deduce this information.
I'd patch the executable directly in a debugger and then save the modified executable.
Decompiling and then recompiling is a fragile process since any change of position of the reassembled code can break the program. And I think there are multiple binary representations of certain asm instructions complicating the matter ever further.
You are losing some information when disassembling an executable so you are unlikely to get a fully working executable when assembling the disassembly. If you are clever, you can extract single functions from an executable, but not the whole program.
The objconv disassembler can produce assembly code in masm, nasm, yasm and gas syntax.
Related
Im trying to learn assembly, first i was using NASM for the compiling, but then i understood that i could use .s files in gcc. This interested me greatly, since my goal for this is to be able to write a compiler for a custom language, so this was very intriguing, as it would allow me to link and compile with c code. So filled with excitement, I started compiling c to assembly (.s files) with gcc, and examen it. As I was doing this, it seamed to be structured in a different way then NASM assembly, with only main label, f.eks, and not _start, and other weird structure, and im not talking about Intel- vs AT&T syntax. So then my question follows:
Is it a different structure, in normal assembly and the .s files in gcc, or is it just me not having a good enough knowlage of assembly? If it is a different structure, does it have a name?
I have been trying to google my way to this for hours, but when i search for gcc assembly, and other things I can think of, I only get c inline assembly...
Please help, im going crazy from not figuring this out.
gcc emits definitions for all the functions present in the translation unit. (unless they're static inline or static and unused or it chooses to inline them everywhere...).
The CRT start files (linked by default by gcc, not re-built from source every time you compile) provides the definition for _start and the other functions you'll see if you disassemble the binary. They're only linked in at the link stage, not as part of compiling a .c to a .s, so you don't see them in gcc -S output.
Related: How to remove "noise" from GCC/clang assembly output? for tips on making compiler asm output human-readable.
Sorry if the title is not very clear. I am using MinGW with GCC 6.3.0 to build a x86 32-bit DLL on Windows (so far). I'll spare you the details why I need hacky offsets amongst its sections accessible from code, so please do not ask if it's useful or not (because I don't want to bother explaining that).
So, if I can get the following testcase to work, I'm good. Here's my problem:
In a C++ file, I want to access a linker symbol as an absolute numeric value, not relocated, directly. Remember that I am building a 32-bit DLL which requires a .reloc section for relocations, but in this case I do NOT want relocation, in fact a relocation would screw it up completely.
Here's an example: retrieve the offset of say __imp__MessageBoxW#16 relative to __IAT_start__, in case you don't know what they are, __imp__MessageBoxW#16 is the relocated pointer to the actual function at runtime, and __IAT_start__ is a linker symbol in the default script file. Here's where it is defined:
.idata BLOCK(__section_alignment__) :
{
/* This cannot currently be handled with grouped sections.
See pe.em:sort_sections. */
KEEP (SORT(*)(.idata$2))
KEEP (SORT(*)(.idata$3))
/* These zeroes mark the end of the import list. */
LONG (0); LONG (0); LONG (0); LONG (0); LONG (0);
KEEP (SORT(*)(.idata$4))
__IAT_start__ = .;
KEEP (SORT(*)(.idata$5))
__IAT_end__ = .;
KEEP (SORT(*)(.idata$6))
KEEP (SORT(*)(.idata$7))
}
So far, no problem. Because GAS doesn't allow me to "subtract" two externally defined symbols (both symbols are defined in the linker), I have to define the symbol in the linker script, so at the end of the linker script I have this:
test_symbol = ABSOLUTE("__imp__MessageBoxW#16" - __IAT_start__);
Then in C++ I use this little inline asm to retrieve this relative difference which is supposed to be a fixed value once linked:
asm("movl $test_symbol, %0":"=r"(var));
Now var should contain that fixed number right? Wrong!
Because test_symbol is an "undefined" symbol as far as the assembler is concerned, it makes it relocated. Or I don't know why, but I tried so many things to force it to be an "absolute constant value symbol" instead of a "relocated symbol" to no avail. Even editing the linker script with many things like LD_FEATURE("SANE_EXPR") and others, doesn't work at all.
Its value is correct only if the DLL does not get relocated.
You see, either GNU LD or the assembler adds an entry in the .reloc section for that movl instruction, which is WRONG!
Is there a way to force it to treat an external/undefined symbol as a fixed CONSTANT and apply no relocation to it whatsoever? Basically, omit it from the .reloc section.
I am going crazy with this, please tell me there's something easy I overlooked, I searched for hours!
In other words, is there a way to use a Linker Symbol from within inline asm/C++ without having it relocated whatsoever? No entry to the .reloc section or anything, basically same as a constant like $1234. So if a DLL gets loaded into another base address, that constant would be the same everytime.
UPDATE: I forgot about this question but decided to bring an update, since it seems it's likely not possible as nobody even commented. For anyone else in the same boat as me, I presume this is a limitation of the COFF object format itself. In other words, external symbols are implicitly relocated, and it doesn't seem there's a way against this.
I didn't "fix" it the way I wanted, I did it in a very hacky way though. If anyone is interested, here's my ugly "hack":
First I put a special "custom" instruction in the inline assembly where I reference this external symbol from C++. This "custom" instruction holds a placeholder instruction that grabs the symbol (normal x86 asm instruction with a dummy constant, e.g. 1234) and a way to identify it. Then let GCC generate the assembly files (.S files), then I parse the assembly with a simple script and when I find that "custom" instruction I insert a label for the linker (make it .global) and at the same time add a directive to a custom "on-the-fly" generated linker script that gets included from my main linker script at the end.
This places data in a temporary section in the resulting DLL with absolute offsets to the custom instruction that I need, but without relocation.
Next, I parse the binary DLL itself, in particular that temporary section I added with all this hack. I take the offsets from there, convert them to file offsets, and modify the DLL's .text section directly where those offsets point (remember those placeholder instructions? it is replacing their immediate constants 1234 with the respective value from the linker's non-relocated constant). Then I strip the temporary section from the DLL, and it's done. Of course, all of this is done automatically by a helper program and script
It's an insane hack, but it works and it's fully automatic now that I got it going. If my assumption is correct that COFF doesn't support non-relocated external symbols, then it's really the only way to use linker constants from C++ without them being relocated, which would be a disaster.
I've been looking recently into creating a new native language. I understand the (very) basics of the PE format and I've grabbed an assembler with a fairly kind interface off the webs, which I've successfully used to implement some simple functions. But I've run into a problem using functions from a library. The only way that I've called library functions from a dynamically compiled function previously is to pass in the function pointer manually- something I can't do if I create PE files and execute them in their own process. Now, I'm not planning on using the CRT, but I will need access to the Win API to implement my own standard libraries. How do I generate a reference to a WinAPI function so that the PE loader will patch it up?
You need to write an import table. It's basically a list of function names that you wish to use in your application. It's pointed to by the PE header. The loader loads the DLL files into the process memory space for you, finds the requested function in their export table and leaves the address for it in the import table. You then usually dereference that and jmp there.
Check out Izelion's assembly tutorial for the full details and for asm examples.
How about starting by emitting C instead of assembly? Then writing directly to ASM is just an optimization.
I'm not being facetious: most compilers turn out some kind of intermediate code before the final native code pass.
I realize you're trying to get away from all the null-delmited rigmarole, but you'll need that for the WinAPI functions anyway.
Re-reading your question: you do realize that you can get the WinAPI function addresses by calling LoadLibrary(), then calling GetProcAddress(), and then setting up the call...right?
If you want to see how to bootstrap this from pure assembly: the old SDKs had ASM sample code, probably the new ones still do. If they don't, the DDK will.
I am trying to save space in my executable and I noticed that several functions are being added into my object files, even though I never call them (the code is from a library).
Is there a way to tell gcc to remove these functions automatically or do I need to remove them manually?
If you are compiling into object files (not executables), then a compiler will never remove any non-static functions, since it's always possible you will link the object file against another object file that will call that function. So your first step should be declaring as many functions as possible static.
Secondly, the only way for a compiler to remove any unused functions would be to statically link your executable. In that case, there is at least the possibility that a program might come along and figure out what functions are used and which ones are not used.
The catch is, I don't believe that gcc actually does this type of cross-module optimization. Your best bet is the -Os flag to optimize for code size, but even then, if you have an object file abc.o which has some unused non-static functions and you link statically against some executable def.exe, I don't believe that gcc will go and strip out the code for the unused functions.
If you truly desperately need this to be done, I think you might have to actually #include the files together so that after the preprocessor pass, it results in a single .c file being compiled. With gcc compiling a single monstrous jumbo source file, you stand the best chance of unused functions being eliminated.
Have you looked into calling gcc with -Os (optimize for size.) I'm not sure if it strips unreached code, but it would be simple enough to test. You could also, after getting your executable back, 'strip' it. I'm sure there's a gcc command-line arg to do the same thing - is it --dead_strip?
In addition to -Os to optimize for size, this link may be of help.
Since I asked this question, GCC 4.5 was released which includes an option to combine all files so it looks like it is just 1 gigantic source file. Using that option, it is possible to easily strip out the unused functions.
More details here
IIRC the linker by default does what you want ins some specific cases. The short of it is that library files contain a bunch of object files and only referenced files are linked in. If you can figure out how to get GCC to emit each function into it's own object file and then build this into a library you should get what you are looking.
I only know of one compiler that can actually do this: here (look at the -lib flag)
I have a static library *.lib created using MSVC on windows. The size of library is say 70KB. Then I have an application which links this library. But now the size of the final executable (*.exe) is 29KB, less than the library. What i want to know is :
Since the library is statically linked, I was thinking it should add directly to the executable size and the final exe size should be more than that? Does windows exe format also do some compression of the binary data?
How is it for linux systems, that is how do sizes of library on linux (*.a/*.la file) relate with size of linux executable (*.out) ?
-AD
A static library on both Windows and Unix is a collection of .obj/.o files. The linker looks at each of these object files and determines if it is needed for the program to link. If it isn't needed, then the object file won't get included in the final executable. This can lead to executables that are smaller then the library.
EDIT: As MSalters points out, on Windows the VC++ compiler now supports generating object files that enable function-level linking, e.g., see here. In fact, edit-and-continue requires this, since the edit-and-continue needs to be able to replace the smallest possible part of the executable.
There is additional bookkeeping information in the .lib file that is not needed for the final executable. This information helps the linker find the code to actually link. Also, debug information may be stored in the .lib file but not in the .exe file (I don't recall where debug info is stored for objs in a lib file, it might be somewhere else).
The static library probably contains several functions which are never used. When the linker links the library with the main executable, it sees that certain functions are never used (and that their addresses are never taken and stored in function pointers), it just throws away the code. It can also do this recursively: if function A() is never called, and A() calls B(), but B() is never otherwise called, it can remove the code for both A() and B(). On Linux, the same thing happens.
A static library has to contain every symbol defined in its source code, because it might get linked into an executable which needs just that specific symbol. But once it is linked into an executable, we know exactly which symbols end up being used, and which ones don't. So the linker can trivially remove unused code, trimming the file size by a lot. Similarly, any duplicate symbols (anything that's defined in both the static library and the executable it's linked into gets merged into a single instance.
Disclaimer: It's been a long time since I dealt with static linking, so take my answer with a grain of salt.
You wrote: I was thinking it should add directly to the executable size and final exe size should be more than that?
Naive linkers work exactly this way - back when I was doing hobby development for CP/M systems (a LONG time ago), this was a real problem.
Modern linkers are smarter, however - they only link in the functions referenced by the original code, or as required.
Additionally to the current answers, the linker is allowed to remove function definitions if they have identical object code - this is intended to help reduce the bloating effects of templated code.
#All: Thanks for the pointers.
#Greg Hewgill - Your answer was a good pointer. Thanks.
The answer i found out was as follows:
1.)During Library building what happens is if the option "Keep Program debug databse" in MSVC (or something alike ) is ON, then library will have this debug info bloating its size.
but when i statically include that library and create a executable, the linker strips all that debug info from the library before geenrating the exe and hence the exe size is less than that of the library.
2.) When i disabled the option "Keep Program debug databse", i got an library whose size was smaller than the final executable, which was what i thought is nromal in most situations.
-AD