Exe optimization - gcc

I have test.exe, without source code, but with debug information, and optimized for intel generic.
Do you know any tool, that lets you optimize your executable?
e.g.: I want to optimize the exe for core 2 duo, for smaller caches, remove debug information etc.
I think disassembly and recompile with gcc would do it, but anyone did something like this? Do I gain some performance?
[Edit]
- disassembly and recompile most likely won't work.

AFAIK, there's nothing you can do to make it run faster.
It'd be worth trying a decompile / recompile. You might get something that still works, and maybe something will be vectorizable. But I think probably not.
To optimise code, a compiler needs to know which behaviour is a requirement for the program to do what's desired, and which behaviour is just an artefact of the specific instructions chosen by the original compiler. The information to know what's important and what isn't is basically lost in the noise of the x86 machine instructions.
Maybe there'd be some peephole optimisations like replacing a sequence of SSE2 instructions with a single SSSE3 one, or something, but probably nothing significant.
You can make the binary smaller with strip, but the debugging info doesn't even get loaded into memory when you run the exe. It's in its own section, so the debugging info isn't mixed in with code/data/needed symbols, and thus doesn't dilute cache density. It only matters when copying the executable around.

Related

ARM Cortex-M compiler differences

I'm about to develop some firmwares for Cortex-M cores on STM32 processors using C for my projects, and searching on the web I've found a lot of different compilers:
Keil, IAR, Linaro, Yagarto and GNU Tools for ARM Embedded Processors.
I was wondering, what functional differences are there between these compilers that might influence my choice? For example as an enthusiast I don't need support or assistance from the vendor, and a limitation on the code size is OK for the moment. Also the ease of use is not a main concern since I like to learn (and for the moment I have both Keil Lite and Eclipse with GNU ARM configured and working).
Is the generated code so different in terms of size/speed between these compilers? Are there any comparison table? (I've found only stale infos on the web)
benchmarking is an artform in and of itself, usually easy to manipulate the results to show whatever you want. I would not expect the compilers to generate the same results except for very small test cases, and sometimes in those small test cases their results are either identical or sometimes vastly different as your test has exposed an optimization that one compiler knows/uses and one the other doesnt.
I used to keep track of such things (compiler performance numbers) with dhrystone for example, but in the case of known benchmarks (not that dhrystone means much anymore, but others) you may find that some compilers are tuning themselves to look good under benchmarks perhaps at the expense of something else.
There is no right answer, there is no universal "best", it is all in the eye of the beholder, you. Which tool is easier for you to use, which do you like better be it for the gui or pretty colors or sound card sounds or whatever. And go from there.
The gnu compiler generally for applications I have tested does not produce code as "fast" which is my benchmark, compared to the others, but there are way more people using the free gnu tools so the support for it is considerably wider due to the number of web pages and forums and examples. gnu wont have a size restriction either, but it may require more learning or whatever to get up and running...
The cortex-ms are split into the armv6m and armv7m families, the v6m (cortex-m0) only have a small number of thumb2 extensions, the armv7m have about 150 thumbv2 extensions to thumb, so you need to know what your tools support and not use the wrong stuff on the wrong chip. Then the compilers if they know all of this may and will produce different instruction mixes from the same source code. Further within the same compiler or family using different command line options you can/will get vastly different code. And then beyond that with a cortex-m4 with cache on if you have one with such a thing, depending on how the code lies in the cache lines you may get vastly different performance, so benchmarking is a research project in itself for each blob of C code you want to benchmark. The performance range within a single compiler may shadow another compiler or the overlap may be enough to not matter.
If you have access to the tools you add value to yourself professionally by learning to use the competing tools and being able to walk into a job and or within your job choose what you see as the right tool for the job or walk into a Kiel house and be able to work right away or a gnu house and work right away. Where you might lose a job if you are gnu only and the job is for a Kiel house.
We have done some comparisons; IAR and Keil typically outperform GCC with default settings. But with some compiler flags you can make GCC come pretty close to the result of IAR and Keil.
Some of the compilers you mention are integrated development environments. Others are just plain compilers.
Some people prefer a integrated environment with compiler, editor and debugger nicely packaged for you. Others prefer to set up their own environment. It is a matter of taste.
In addition to Yagarto, there is also the "Code Sourcery" distribution of GCC for ARM.
Performance should not be your first concern unless when it becomes so in a production environment. The reason is that first, most ARM compilers are plenty good enough, and really you are down to GCC based, Keil, and IAR. Second, most ARM MCU are "blazingly fast" and have "so much memory" (these are comparing to 8-bit MCU like AVR/PIC but also to older PC). A decent Cortex-M4 MCU runs up to 100MHz and has 256K of flash. Again, to put it in perspective, that's more memory and 10x faster clock rate than the original Macintosh etc. We went to the Moon with much less ;-)
Now the performance of the tools itself, in particular, the IDE and the debuggers, differ greatly. For example, the popular Eclipse is written in Java, might be a bit sluggish to slower or memory-starved PCs. The best thing to do is to install GCC+Eclipse, and the vendors' demos and see for yourself.

Can Visual Studio tell me the SSE2 register spill count of compiled code?

I do not have any real compiler knowledge, and I used to hand-code SSE2 functions for selected pieces of code. I know how to read the generated machine code, but largely unaware of the crazy optimizations made possible by compilers. All of my work is done using Visual Studio.
Is there a way for Visual Studio to tell me the SSE2 register spill count of a piece of function? The reason is that we are soon able to mass-produce SSE2-like code (templated), and we would like each one of them to be compiled into decent quality machine code. We possibly can't manually check each one of them. What I hope to get is some sort of guarantee that the compiled code is acceptable and concise. I don't need to get the last bit of juice.
Alternatively, is there a keyword that works like __forceinline that forces compiler to not spill any SSE2 registers, like "__forcenospill" ? (If spill has to happen, the compile will fail, and therefore I would be aware of the problem and try to refactor my SSE2 code.)
Using an existing vector-library or blitter would be out of question because some of the calculations need to be highly registerized (6 or more operands in one step in a "simple operation" (Note #1); intermediate values promoted to 16-bit or 32-bit on-the-fly and converted back, etc) Rephrasing it with a generic vector-library would mean doubling or tripling of runtime (been there, done that).
Commercial tools are okay too, I can certainly afford it given the project's nature.
If there is no such tool, I will resort to profiling. You may downvote this post to let me know that such things don't exist.
Thanks!
(Note #1) it's an adaptive thresholding algorithm.

Questions about possible java(or other memory managed language) optimizations

From what I have read java (usually) seems to compile java to not very (is at all?) optimised java bytecode, leaving it to the jit to optimise. Is this true? And if it is has there been any exploration (possibly in alternative implementations) of getting the compiler to optimise the code so the jit has less work to do (is this possible)?
Also many people seem to have a dislike for native code generation (sometimes referred to as ahead of time compilation) for Java (and many other high level memory managed languages) , for many reasons such as loss of portability (and ect.) , but also partially because (at least for those languages that have a just in time compiler) the thinking goes that ahead of time compilation to machine code will miss the possible optimisations that can be done by a jit compiler and therefore may be slower in the long run.
This leads me to wonder whether anyone has ever tried to implement http://en.wikipedia.org/wiki/Profile-guided_optimization (compiling to a binary + some extras then running the program and analysing the runtime information of the test run to generate a hopefully more optimised binary for real world usage) for java/(other memory managed languages) and how this would compare to jit code? Anyone have a clue?
Personally, I think the big difference is not between JIT compiling and AOT compiling, but between class-compilation and whole-program optimization.
When you run javac, it only looks at a single .java file, compiling it into a single .class file. All the interface implementations and virtual methods and overrides are checked for validity but left unresolved (because it's impossible to know the true method invocation targets without analyzing the whole program).
The JVM uses "runtime loading and linking" to assemble all of your classes into a coherent program (and any class in your program can invoke specialized behavior to change the default loading/linking behavior).
But then, at runtime, the JVM can remove the vast majority of virtual methods. It can inline all of your getters and setters, turning them into raw fields. And when those raw fields are inlined, it can perform constant-propagation to further optimize the code. (At runtime, there's no such thing as a private field.) And if there's only one thread running, the JVM can eliminate all synchronization primitives.
To make a long story short, there are a lot of optimizations that aren't possible without analyzing the whole program, and the best time for doing whole program analysis is at runtime.
Profile-guided optimization has some caveats, one of them mentioned even in the Wiki article you linked. It's results are valid
for the given samples, representing how your code is actually used by the user or other code.
for the given platform (CPU, memory + other hardware, OS, whatever).
From the performance point of view there are quite big differences even among platforms that are usually considered (more or less) the same (e.g. compare a single core, old Athlon with 512M with a 6 core Intel with 8G, running on Linux, but with very different kernel versions).
for the given JVM and its config.
If any of these change then your profiling results (and the optimizations based on them) are not necessary valid any more. Most likely some of the optimizations will still have a beneficial effect, but some of them may turn out suboptimal (or even degrading performance).
As it was mentioned the JIT JVMs do something very similar to profiling, but they do it on the fly. It's also called 'hotspot', because it constantly monitors the executed code, looks for hot spots that are executed frequently and will try to optimize only those parts. At this point it will be able to exploit more knowledge about the code (knowing the context of it, how it is used by other classes, etc.) so - as mentioned by you and the other answers - it can do better optimizations as a static one. It will continue monitoring and if its needed it will do another turn of optimization later, this time trying even harder (looking for more, more expensive optimizations).
Working on the real life data (usage statistics + platform + config) it can avoid the caveats mentioned before.
The price of it is some additional time it needs to spend on "profiling" + JIT-ing. Most of the time its spent quite well.
I guess a profile-guided optimizer could still compete with it (or even beat it), but only in some special cases, if you can avoid the caveats:
you are quite sure that your samples represent the real life scenario well and they won't change too much during execution.
you know your target platform quite precisely and can do the profiling on it.
and of course you know/control the JVM and its config.
It will happen rarely and I guess in general JIT will give you better results, but I have no evidence for it.
Another possibility for getting value from the profile-guided optimization if you target a JVM that can't do JIT optimization (I think most small devices have such a JVM).
BTW one disadvantage mentioned in other answers would be quite easy to avoid: if static/profile guided optimization is slow (which is probably the case) then do it only for releases (or RCs going to testers) or during nightly builds (where time does not matter so much).
I think the much bigger problem would be to have good sample test cases. Creating and maintaining them is usually not easy and takes a lot of time. Especially if you want to be able to execute them automatically, which would be quite essential in this case.
The official Java Hot Spot compiler does "adaptive optimisation" at runtime, which is essentially the same as the profile-guided optimisation you mentioned. This has been a feature of at least this particular Java implementation for a long time.
The trade-off to performing more static analysis or optimisation passes up-front at compile time is essentially the (ever-diminishing) returns you get from this extra effort against the time it takes for the compiler to run. A compiler like MLton (for Standard ML) is a whole-program optimising compiler with a lot of static checks. It produces very good code, but becomes very, very slow on medium-to-large programs, even on a fast system.
So the Java approach seems to be to use JIT and adaptive optimisation as much as possible, with the initial compilation pass just producing an acceptable valid binary. The absolute opposite end is to use an approach like that of something like MLKit, which does a lot of static inference of regions and memory behaviour.

Windows API calls from assembly while minimizing program size

I'm trying to write a program in assembly and make the resulting executable as small as possible. Some of what I'm doing requires windows API calls to functions such as WriteProcessMemory. I've had some success with calling these functions, but after compiling and linking, my program comes out in the range of 14-15 KB. (From a source of less than 1 KB) I was hoping for much, much less than that.
I'm very new to doing low level things like this so I don't really know what would need to be done to make the program smaller. I understand that the exe format itself takes up quite a bit of space. Can anything be done to minimize that?
I should mention that I'm using NASM and GCC but I can easily change if that would help.
See Tiny PE for a bunch of tips and tricks you can use to reduce the final size of your executable. Be warned that some of the later techniques in that article are extremely fragile.
The default section alignment for most PE files is 4K to align with the natural system memory layout. If you have a .data, .text and .resource section - that's 12K already. Most of it will be 0's and a waste of space.
There are a few things you can do to minimize this waste. First, reduce the section alignment to 512 bytes (don't know the options needed for nasm/gcc). Second, merge the sections so that you only have a single .text section. This can be a problem though for modern machines with the NX bit turned on. This security feature prevents modification of executable sections of code from things like viruses.
There are also a slew of PE compression tools out there that will compact your PE and decompress it when executed.
I suggest using the DumpBin utility (or GNU's objdump) to determine what takes the most space. It may be resource files, huge global variables or something like that.
FWIW, the smallest programs I can assemble using ML or ML64 are on the order of 3kb. (That's just saying hello world and exiting.)
Give me a small C program (not C++), and I'll show you how to make a 1 ko .exe with it. The smallest size of executable I recommend is 1K, because it will fail to run on some Windows if it's not at least this size.
You merely have to play with linker switches to make it happen!
A good linker to do this is polink.
And if you do everything in Assembly, it's even easier. Just go to the MASM32 forum and you'll see plenty of programs like this.

What is your favourite anti-debugging trick?

Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
At my previous employer we used a third party component which basically was just a DLL and a header file. That particular module handled printing in Win32. However, the company that made the component went bankcrupt so I couldn't report a bug I'd found.
So I decided to fix the bug myself and launched the debugger. I was surprised to find anti-debugging code almost everywhere, the usual IsDebuggerPresent, but the thing that caught my attention was this:
; some twiddling with xor
; and data, result in eax
jmp eax
mov eax, 0x310fac09
; rest of code here
At the first glance I just stepped over the routine which was called twice, then things just went bananas. After a while I realized that the bit twiddling result was always the same, i.e. the jmp eax always jumped right into the mov eax, 0x310fac09 instruction.
I dissected the bytes and there it was, 0f31, the rdtsc instruction which was used to measure the time spent between some calls in the DLL.
So my question to SO is: What is your favourite anti-debugging trick?
My favorite trick is to write a simple instruction emulator for an obscure microprocessor.
The copy protection and some of the core functionality will then compiled for the microprocessor (GCC is a great help here) and linked into the program as a binary blob.
The idea behind this is, that the copy protection does not exist in ordinary x86 code and as such cannot be disassembled. You cannot remove the entire emulator either because this would remove core functionality from the program.
The only chance to hack the program is to reverse engineer what the microprocessor emulator does.
I've used MIPS32 for emulation because it was so easy to emulate (it took just 500 lines of simple C-code). To make things even more obscure I didn't used the raw MIPS32 opcodes. Instead each opcode was xor'ed with it's own address.
The binary of the copy protection looked like garbage-data.
Highly recommended! It took more than 6 month before a crack came out (it was for a game-project).
I've been a member of many RCE communities and have had my fair share of hacking & cracking. From my time I've realized that such flimsy tricks are usually volatile and rather futile. Most of the generic anti-debugging tricks are OS specific and not 'portable' at all.
In the aforementioned example, you're presumably using inline assembly and a naked function __declspec, both which are not supported by MSVC when compiling on the x64 architecture. There are of course still ways to implement the aforementioned trick but anybody who has been reversing for long enough will be able to spot and defeat that trick in a matter of minutes.
So generally I'd suggest against using anti-debugging tricks outside of utilizing the IsDebuggerPresent API for detection. Instead, I'd suggest you code a stub and/or a virtual machine. I coded my own virtual machine and have been improving on it for many years now and I can honestly say that it has been by far the best decision I've made in regards to protecting my code so far.
Spin off a child process that attaches to parent as a debugger & modifies key variables. Bonus points for keeping the child process resident and using the debugger memory operations as a kind of IPC for certain key operations.
On my system, you can't attach two debuggers to the same process.
Nice thing about this one is unless they try to tamper w/ things nothing breaks.
Reference uninitialized memory! (And other black magic/vodoo...)
This is a very cool read:
http://spareclockcycles.org/2012/02/14/stack-necromancy-defeating-debuggers-by-raising-the-dead/
The most modern obfuscation method seems to be the virtual machine.
You basically take some part of your object code, and convert it to your own bytecode format. Then you add a small virtual machine to run this code. Only way to properly debug this code will be to code an emulator or disassembler for your VM's instruction format. Of course you need to think of performance too. Too much bytecode will make your program run slower than native code.
Most old tricks are useless now:
Isdebuggerpresent : very lame and easy to patch
Other debugger/breakpoint detections
Ring0 stuff : users don't like to install drivers, you might actually break something on their system etc.
Other trivial stuff that everybody knows, or that makes your software unstable. remember that even if a crack makes your program unstable but it still works, this unstability will be blamed on you.
If you really want to code the VM solution yourself (there are good programs for sale), don't use just one instruction format. Make it polymorphic, so that you can have different parts of the code have different format. This way all your code can't be broken by writing just one emulator/disassembler. For example MIPS solution some people offered seems to be easily broken because MIPS instruction format is well documented and analysis tools like IDA can already disassemble the code.
List of instruction formats supported by IDA pro disassembler
I would prefer that people write software that is solid, reliable and does what it is advertised to do. That they also sell it for a reasonable price with a reasonable license.
I know that I have wasted way too much time dealing with vendors that have complicated licensing schemes that only cause problems for the customers and the vendors. It is always my recommendation to avoid those vendors. Working at a nuclear power plant we are forced to use certain vendors products and thus are forced to have to deal with their licensing schemes. I wish there was a way to get back the time that I have personally wasted dealing with their failed attempts to give us a working licensed product. It seems like a small thing to ask, but yet it seems to be a difficult thing for people that get too tricky for their own good.
I second the virtual machine suggestion. I implemented a MIPS I simulator that (now) can execute binaries generated with mipsel-elf-gcc. Add to that code/data encryption capabilities (AES or with any other algorithm of your choice), the ability of self-simulation (so you can have nested simulators) and you have a pretty good code obfuscator.
The nice feature of choosing MIPS I is that 1) it's easy to implement, 2) I can write code in C, debug it on my desktop and just cross-compile it for MIPS when it's done. No need to debug custom opcodes or manually write code for a custom VM..
My personal favourite was on the Amiga, where there is a coprocessor (the Blitter) doing large data transfers independent from the processor; this chip would be instructed to clear all memory, and reset from a timer IRQ.
When you attached an Action Replay cartridge, stopping the CPU would mean that the Blitter would continue clearing the memory.
Calculated jumps in the middle of a legitimate looking but really hiding an actual instruction instructions are my favorite. They are pretty easy to detect for humans anyway, but automated tools often mess it up.
Also replacing a return address on the stack makes a good time waster.
Using nop to remove assembly via the debugger is a useful trick. Of course, putting the code back is a lot harder!!!

Resources