I'm trying to write a program in assembly and make the resulting executable as small as possible. Some of what I'm doing requires windows API calls to functions such as WriteProcessMemory. I've had some success with calling these functions, but after compiling and linking, my program comes out in the range of 14-15 KB. (From a source of less than 1 KB) I was hoping for much, much less than that.
I'm very new to doing low level things like this so I don't really know what would need to be done to make the program smaller. I understand that the exe format itself takes up quite a bit of space. Can anything be done to minimize that?
I should mention that I'm using NASM and GCC but I can easily change if that would help.
See Tiny PE for a bunch of tips and tricks you can use to reduce the final size of your executable. Be warned that some of the later techniques in that article are extremely fragile.
The default section alignment for most PE files is 4K to align with the natural system memory layout. If you have a .data, .text and .resource section - that's 12K already. Most of it will be 0's and a waste of space.
There are a few things you can do to minimize this waste. First, reduce the section alignment to 512 bytes (don't know the options needed for nasm/gcc). Second, merge the sections so that you only have a single .text section. This can be a problem though for modern machines with the NX bit turned on. This security feature prevents modification of executable sections of code from things like viruses.
There are also a slew of PE compression tools out there that will compact your PE and decompress it when executed.
I suggest using the DumpBin utility (or GNU's objdump) to determine what takes the most space. It may be resource files, huge global variables or something like that.
FWIW, the smallest programs I can assemble using ML or ML64 are on the order of 3kb. (That's just saying hello world and exiting.)
Give me a small C program (not C++), and I'll show you how to make a 1 ko .exe with it. The smallest size of executable I recommend is 1K, because it will fail to run on some Windows if it's not at least this size.
You merely have to play with linker switches to make it happen!
A good linker to do this is polink.
And if you do everything in Assembly, it's even easier. Just go to the MASM32 forum and you'll see plenty of programs like this.
Related
OK, I have the problem, I do not know exactly the correct terms in order to find what I am looking for on google. So I hope someone here can help me out.
When developing real time programs on embedded devices you might have to iterate a few hundred or thousand times until you get the desired result. When using e.g. ARM devices you wear out the internal flash quite quickly. So typically you develop your programs to reside in the RAM of the device and all is ok. This is done using GCC's functionality to split the code in various sections.
Unfortunately, the RAM of most devices is much smaller than the flash. So at one point in time, your program gets too big to fit in RAM with all variables etc. (You choose the size of the device such that one assumes it will fit the whole code in flash later.)
Classical shared objects do not work as there is nothing like a dynamical linker in my environment. There is no OS or such.
My idea was the following: For the controller it is no problem to execute code from both RAM and flash. When compiling with the correct attributes for the functions this is also no big problem for the compiler to put part of the program in RAM and part in flash.
When I have some functionality running successfully I create a library and put this in the flash. The main development is done in the 'volatile' part of the development in RAM. So the flash gets preserved.
The problem here is: I need to make sure, that the library always gets linked to the exact same location as long as I do not reflash. So a single function must always be on the same address in flash for each compile cycle. When something in the flash is missing it must be placed in RAM or a lining error must be thrown.
I thought about putting together a real library and linking against that. Here I am a bit lost. I need to tell GCC/LD to link against a prelinked file (and create such a prelinked file).
It should be possible to put all the library objects together and link this together in the flash. Then the addresses could be extracted and the main program (for use in RAM) can link against it. But: How to do these steps?
In the internet there is the term prelink as well as a matching program for linux. This is intended to speed up the loading times. I do not know if this program might help me out as a side effect. I doubt it but I do not understand the internals of its work.
Do you have a good idea how to reach the goal?
You are solving a non-problem. Embedded flash usually has a MINIMUM write cycle of 10,000. So even if you flash it 20 times a day, it will last a year and half. An St-Nucleo is $13. So that's less than 3 pennies a day :-). The TYPICAL write cycle is even longer, at about 100,000. It will be a long time before you wear them out.
Now if you are using them for dynamic storage, that might be a concern, depending on the usage patterns.
But to answer your questions, you can build your code into a library .a file easily enough. However, GCC does not guarantee that it links the object code in any order, as it depends on optimization level. Furthermore, only functions that are referenced in a library file is pulled in, so if your function calls change, it may pull in more or less library functions.
I have test.exe, without source code, but with debug information, and optimized for intel generic.
Do you know any tool, that lets you optimize your executable?
e.g.: I want to optimize the exe for core 2 duo, for smaller caches, remove debug information etc.
I think disassembly and recompile with gcc would do it, but anyone did something like this? Do I gain some performance?
[Edit]
- disassembly and recompile most likely won't work.
AFAIK, there's nothing you can do to make it run faster.
It'd be worth trying a decompile / recompile. You might get something that still works, and maybe something will be vectorizable. But I think probably not.
To optimise code, a compiler needs to know which behaviour is a requirement for the program to do what's desired, and which behaviour is just an artefact of the specific instructions chosen by the original compiler. The information to know what's important and what isn't is basically lost in the noise of the x86 machine instructions.
Maybe there'd be some peephole optimisations like replacing a sequence of SSE2 instructions with a single SSSE3 one, or something, but probably nothing significant.
You can make the binary smaller with strip, but the debugging info doesn't even get loaded into memory when you run the exe. It's in its own section, so the debugging info isn't mixed in with code/data/needed symbols, and thus doesn't dilute cache density. It only matters when copying the executable around.
I ve written a code in C for ATmega128 and
I d like to know how the changes that I do in the code influence the Program Memory.
To be more specific, let's consider that the code is similar to that one:
d=fun1(a,b);
c=fun2(c,d);
the change that I do in the code is that I call the same functions more times e.g.:
d=fun1(a,b);
c=fun2(c,d);
h=fun1(k,l);
n=fun2(p,m);
etc...
I build the solution at the AtmelStudio 6.1 and I see the changes in the Program Memory.
Is there anyway to foresee, without builiding the solution, how the chages in the code will affect the program memory?
Thanks!!
Generally speaking this is next to impossible using C/C++ (that means the effort does not pay off).
In your simple case (the number of calls increase), you can determine the number of instructions for each call, and multiply by the number. This will only be correct, if the compiler does not inline in all cases, and does not apply optimzations at a higher level.
These calculations might be wrong, if you upgrade to a newer gcc version.
So normally you only get exact numbers when you compare two builds (same compiler version, same optimisations). avr-size and avr-nm gives you all information, for example to compare functions by size. You can automate this task (by converting the output into .csv files), and use a spreadsheet or diff to look for changes.
This method normally only pays off, if you have to squeeze a program into a smaller device (from 4k flash into 2k for example - you already have 128k flash, that's quite a lot).
This process is frustrating, because if you apply the same design pattern in C with small differences, it can lead to different sizes: So from C/C++, you cannot really predict what's going to happen.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
At my previous employer we used a third party component which basically was just a DLL and a header file. That particular module handled printing in Win32. However, the company that made the component went bankcrupt so I couldn't report a bug I'd found.
So I decided to fix the bug myself and launched the debugger. I was surprised to find anti-debugging code almost everywhere, the usual IsDebuggerPresent, but the thing that caught my attention was this:
; some twiddling with xor
; and data, result in eax
jmp eax
mov eax, 0x310fac09
; rest of code here
At the first glance I just stepped over the routine which was called twice, then things just went bananas. After a while I realized that the bit twiddling result was always the same, i.e. the jmp eax always jumped right into the mov eax, 0x310fac09 instruction.
I dissected the bytes and there it was, 0f31, the rdtsc instruction which was used to measure the time spent between some calls in the DLL.
So my question to SO is: What is your favourite anti-debugging trick?
My favorite trick is to write a simple instruction emulator for an obscure microprocessor.
The copy protection and some of the core functionality will then compiled for the microprocessor (GCC is a great help here) and linked into the program as a binary blob.
The idea behind this is, that the copy protection does not exist in ordinary x86 code and as such cannot be disassembled. You cannot remove the entire emulator either because this would remove core functionality from the program.
The only chance to hack the program is to reverse engineer what the microprocessor emulator does.
I've used MIPS32 for emulation because it was so easy to emulate (it took just 500 lines of simple C-code). To make things even more obscure I didn't used the raw MIPS32 opcodes. Instead each opcode was xor'ed with it's own address.
The binary of the copy protection looked like garbage-data.
Highly recommended! It took more than 6 month before a crack came out (it was for a game-project).
I've been a member of many RCE communities and have had my fair share of hacking & cracking. From my time I've realized that such flimsy tricks are usually volatile and rather futile. Most of the generic anti-debugging tricks are OS specific and not 'portable' at all.
In the aforementioned example, you're presumably using inline assembly and a naked function __declspec, both which are not supported by MSVC when compiling on the x64 architecture. There are of course still ways to implement the aforementioned trick but anybody who has been reversing for long enough will be able to spot and defeat that trick in a matter of minutes.
So generally I'd suggest against using anti-debugging tricks outside of utilizing the IsDebuggerPresent API for detection. Instead, I'd suggest you code a stub and/or a virtual machine. I coded my own virtual machine and have been improving on it for many years now and I can honestly say that it has been by far the best decision I've made in regards to protecting my code so far.
Spin off a child process that attaches to parent as a debugger & modifies key variables. Bonus points for keeping the child process resident and using the debugger memory operations as a kind of IPC for certain key operations.
On my system, you can't attach two debuggers to the same process.
Nice thing about this one is unless they try to tamper w/ things nothing breaks.
Reference uninitialized memory! (And other black magic/vodoo...)
This is a very cool read:
http://spareclockcycles.org/2012/02/14/stack-necromancy-defeating-debuggers-by-raising-the-dead/
The most modern obfuscation method seems to be the virtual machine.
You basically take some part of your object code, and convert it to your own bytecode format. Then you add a small virtual machine to run this code. Only way to properly debug this code will be to code an emulator or disassembler for your VM's instruction format. Of course you need to think of performance too. Too much bytecode will make your program run slower than native code.
Most old tricks are useless now:
Isdebuggerpresent : very lame and easy to patch
Other debugger/breakpoint detections
Ring0 stuff : users don't like to install drivers, you might actually break something on their system etc.
Other trivial stuff that everybody knows, or that makes your software unstable. remember that even if a crack makes your program unstable but it still works, this unstability will be blamed on you.
If you really want to code the VM solution yourself (there are good programs for sale), don't use just one instruction format. Make it polymorphic, so that you can have different parts of the code have different format. This way all your code can't be broken by writing just one emulator/disassembler. For example MIPS solution some people offered seems to be easily broken because MIPS instruction format is well documented and analysis tools like IDA can already disassemble the code.
List of instruction formats supported by IDA pro disassembler
I would prefer that people write software that is solid, reliable and does what it is advertised to do. That they also sell it for a reasonable price with a reasonable license.
I know that I have wasted way too much time dealing with vendors that have complicated licensing schemes that only cause problems for the customers and the vendors. It is always my recommendation to avoid those vendors. Working at a nuclear power plant we are forced to use certain vendors products and thus are forced to have to deal with their licensing schemes. I wish there was a way to get back the time that I have personally wasted dealing with their failed attempts to give us a working licensed product. It seems like a small thing to ask, but yet it seems to be a difficult thing for people that get too tricky for their own good.
I second the virtual machine suggestion. I implemented a MIPS I simulator that (now) can execute binaries generated with mipsel-elf-gcc. Add to that code/data encryption capabilities (AES or with any other algorithm of your choice), the ability of self-simulation (so you can have nested simulators) and you have a pretty good code obfuscator.
The nice feature of choosing MIPS I is that 1) it's easy to implement, 2) I can write code in C, debug it on my desktop and just cross-compile it for MIPS when it's done. No need to debug custom opcodes or manually write code for a custom VM..
My personal favourite was on the Amiga, where there is a coprocessor (the Blitter) doing large data transfers independent from the processor; this chip would be instructed to clear all memory, and reset from a timer IRQ.
When you attached an Action Replay cartridge, stopping the CPU would mean that the Blitter would continue clearing the memory.
Calculated jumps in the middle of a legitimate looking but really hiding an actual instruction instructions are my favorite. They are pretty easy to detect for humans anyway, but automated tools often mess it up.
Also replacing a return address on the stack makes a good time waster.
Using nop to remove assembly via the debugger is a useful trick. Of course, putting the code back is a lot harder!!!
I was thinking more about the programming language i am designing. and i was wondering, what are ways i could minimize its compile time?
Your main problem today is I/O. Your CPU is many times faster than main memory and memory is about 1000 times faster than accessing the hard disk.
So unless you do extensive optimizations to the source code, the CPU will spend most of the time waiting for data to be read or written.
Try these rules:
Design your compiler to work in several, independent steps. The goal is to be able to run each step in a different thread so you can utilize multi-core CPUs. It will also help to parallelize the whole compile process (i.e. compile more than one file at the same time)
It will also allow you to load many source files in advance and preprocess them so the actual compile step can work faster.
Try to allow to compile files independently. For example, create a "missing symbol pool" for the project. Missing symbols should not cause compile failures as such. If you find a missing symbol somewhere, remove it from the pool. When all files have been compiled, check that the pool is empty.
Create a cache with important information. For example: File X uses symbols from file Y. This way, you can skip compiling file Z (which doesn't reference anything in Y) when Y changes. If you want to go one step further, put all symbols which are defined anywhere in a pool. If a file changes in such a way that symbols are added/removed, you will know immediately which files are affected (without even opening them).
Compile in the background. Start a compiler process which checks the project directory for changes and compile them as soon as the user saves the file. This way, you will only have to compile a few files each time instead of everything. In the long run, you will compile much more but for the user, turnover times will be much shorter (= time user has to wait until she can run the compiled result after a change).
Use a "Just in time" compiler (i.e. compile a file when it is used, for example in an import statement). Projects are then distributed in source form and compiled when run for the first time. Python does this. To make this perform, you can precompile the library during the installation of your compiler.
Don't use header files. Keep all information in a single place and generate header files from the source if you have to. Maybe keep the header files just in memory and never save them to disk.
what are ways i could minimize its compile time?
No compilation (interpreted language)
Delayed (just in time) compilation
Incremental compilation
Precompiled header files
I've implemented a compiler myself, and ended up having to look at this once people started batch feeding it hundreds of source files. I was quite suprised what I found out.
It turns out that the most important thing you can optimize is not your grammar. It's not your lexical analyzer or your parser either. Instead, the most important thing in terms of speed is the code that reads in your source files from disk. I/O's to disk are slow. Really slow. You can pretty much measure your compiler's speed by the number of disk I/Os it performs.
So it turns out that the absolute best thing you can do to speed up a compiler is to read the entire file into memory in one big I/O, do all your lexing, parsing, etc. from RAM, and then write out the result to disk in one big I/O.
I talked with one of the head guys maintaining Gnat (GCC's Ada compiler) about this, and he told me that he actually used to put everything he could onto RAM disks so that even his file I/O was really just RAM reads and writes.
In most languages (pretty well everything other than C++), compiling individual compilation units is quite fast.
Binding/linking is often what's slow - the linker has to reference the whole program rather than just a single unit.
C++ suffers as - unless you use the pImpl idiom - it requires the implementation details of every object and all inline functions to compile client code.
Java (source to bytecode) suffers because the grammar doesn't differentiate objects and classes - you have to load the Foo class to see if Foo.Bar.Baz is the Baz field of object referenced by the Bar static field of the Foo class, or a static field of the Foo.Bar class. You can make the change in the source of the Foo class between the two, and not change the source of the client code, but still have to recompile the client code, as the bytecode differentiates between the two forms even though the syntax doesn't. AFAIK Python bytecode doesn't differentiate between the two - modules are true members of their parents.
C++ and C suffer if you include more headers than are required, as the preprocessor has to process each header many times, and the compiler compile them. Minimizing header size and complexity helps, suggesting better modularity would improve compilation time. It's not always possible to cache header compilation, as what definitions are present when the header is preprocessed can alter its semantics, and even syntax.
C suffers if you use the preprocessor a lot, but the actual compilation is fast; much of C code uses typedef struct _X* X_ptr to hide implementation better than C++ does - a C header can easily consist of typedefs and function declarations, giving better encapsulation.
So I'd suggest making your language hide implementation details from client code, and if you are an OO language with both instance members and namespaces, make the syntax for accessing the two unambiguous. Allow true modules, so client code only has to be aware of the interface rather than implementation details. Don't allow preprocessor macros or other variation mechanism to alter the semantics of referenced modules.
Here are some performance tricks that we've learned by measuring compilation speed and what affects it:
Write a two-pass compiler: characters to IR, IR to code. (It's easier to write a three-pass compiler that goes characters -> AST -> IR -> code, but it's not as fast.)
As a corollary, don't have an optimizer; it's hard to write a fast optimizer.
Consider generating bytecode instead of native machine code. The virtual machine for Lua is a good model.
Try a linear-scan register allocator or the simple register allocator that Fraser and Hanson used in lcc.
In a simple compiler, lexical analysis is often the greatest performance bottleneck. If you are writing C or C++ code, use re2c. If you're using another language (which you will find much more pleasant), read the paper aboug re2c and apply the lessons learned.
Generate code using maximal munch, or possibly iburg.
Surprisingly, the GNU assembler is a bottleneck in many compilers. If you can generate binary directly, do so. Or check out the New Jersey Machine-Code Toolkit.
As noted above, design your language to avoid anything like #include. Either use no interface files or precompile your interface files. This tactic dramatically reduces the burdern on the lexer, which as I said is often the biggest bottleneck.
Here's a shot..
Use incremental compilation if your toolchain supports it.
(make, visual studio, etc).
For example, in GCC/make, if you have many files to compile, but only make changes in one file, then only that one file is compiled.
Eiffel had an idea of different states of frozen, and recompiling didn't necessarily mean that the whole class was recompiled.
How much can you break up the compliable modules, and how much do you care to keep track of them?
Make the grammar simple and unambiguous, and therefore quick and easy to parse.
Place strong restrictions on file inclusion.
Allow compilation without full information whenever possible (eg. predeclaration in C and C++).
One-pass compilation, if possible.
One thing surprisingly missing in answers so far: make you you're doing a context free grammar, etc. Have a good hard look at languages designed by Wirth such as Pascal & Modula-2. You don't have to reimplement Pascal, but the grammar design is custom made for fast compiling. Then see if you can find any old articles about the tricks Anders pulled implementing Turbo Pascal. Hint: table driven.
it depends on what language/platform you're programming for. for .NET development, minimise the number of projects that you have in your solution.
In the old days you could get dramatic speedups by setting up a RAM drive and compiling there. Don't know if this still holds true, though.
In C++ you could use distributed compilation with tools like Incredibuild
A simple one: make sure the compiler can natively take advantage of multi-core CPUs.
Make sure that everything can be compiled the fist time you try to compile it. E.g. ban forward references.
Use a context free grammar so that you can find the correct parse tree without a symbol table.
Make sure that the semantics can be deduced from the syntax so you can construct the correct AST directly rather than by mucking with a parse tree and symbol table.
How serious a compiler is this?
Unless the syntax is pretty convoluted, the parser should be able to run no more than 10-100 times slower than just indexing through the input file characters.
Similarly, code generation should be limited by output formatting.
You shouldn't be hitting any performance issues unless you're doing a big, serious compiler, capable of handling mega-line apps with lots of header files.
Then you need to worry about precompiled headers, optimization passes, and linking.
I haven't seen much work done for minimizing the compile time. But some ideas do come to mind:
Keep the grammar simple. Convoluted grammar will increase your compile time.
Try making use of parallelism, either using multicore GPU or CPU.
Benchmark a modern compiler and see what are the bottlenecks and what you can do in you compiler/language to avoid them.
Unless you are writing a highly specialized language, compile time is not really an issue..
Make a build system that doesn't suck!
There's a huge amount of programs out there with maybe 3 source files that take under a second to compile, but before you get that far you'd have to sit through an automake script that takes about 2 minutes checking things like the size of an int. And if you go to compile something else a minute later, it makes you sit through almost exactly the same set of tests.
So unless your compiler is doing awful things to the user like changing the size of its ints or changing basic function implementations between runs, just dump that info out to a file and let them get it in a second instead of 2 minutes.