For fun, I'm working on a compiler for a small language, and I'm targeting the ARM instruction set first due to its ease. Currently, I'm able to compile the code so I have ARM machine code for the body of each method. At this point I need to start tying a few things together:
What format should I persist my machine code to so I can...
Run it in what debugger?
Currently there's no I/O support, etc., so debugging will be heavily keyed to my ability to step through the disassembly and view processor registers/memory.
I'm running Windows and my compiler only runs in Windows, so having some sort of emulator on Windows would be preferable.
Edit: It appears I can use the Visual Studio Windows Mobile 6 emulator. For now, I might be able to simply save the results in a simple binary format and load it into emulator memory via a tiny C++ console application, then jump into it with a function pointer. Later, it appears I would need to support the ELF and PE formats.
Regarding file formats... the most simple would be:
Motorola S-record
Intel hex file
Those formats can record the binary data and the target address range(s) for the data to be loaded. That's about it.
A more capable format to contain more information:
ELF
for maximum information, include DWARF debug information
ELF is fairly widely supported, and not too complex. DWARF allows you to record very expressive debug information for debugging of complex language constructs. However, to achieve that expressiveness, it can be a very complex format to write.
Related
I am developing an algorithm that uses ARM Neon instructions. I am writing the code using assembler file (.S and no inline asm).
My question is that what is the best way for debugging purpose i.e. viewing registers, memory, etc.
Currently, I am using Android NDK to compile and my Android phone to run the algorithm.
Poor man's debug solutions...
You can use gdb / gdbserver to remotely control execution of applications on an Android phone. I'm not giving full details here because they change all the time but for example you can start with this answer or make a quick search on Internet. Learning to use GDB might seem to have a high steep curve however material on web is exhaustive. You can easily find something to your taste.
Single-stepping an ARM core via software tools is hard that's why ARM ecosystem is full of expensive tools and extra HW equipment.
Trick I use is to insert BRK instructions manually in assembly code. BRK is Self-hosted debug breakpoint. When core sees this instruction it stops and informs OS about situation. OS then notifies debugger about the situation and passes control to it. When debugger gets control you can check contents of registers and probably even make changes to them. Last part of the operation is to make your process continue. Since PC is still at our break point instruction what you must do is to increase PC, set it to instruction after BRK.
Since you mentioned you use .S files instead of .s files you can utilize gcc to do preprocessing / macro work. This way enabling, disabling BRK might become less of an issue.
Big down side of this way of working is turnaround time. If there is a certain point that you want to investigate with gdb you must make sure there is a BRK instruction there and this will probably require another build/push/debug cycle.
I have some crucial data written decades ago by an ancient 16bit DOS application.
There are no docs, no source, and no information about the author. Just the 16 bit exe.
I guess it's time for me to learn how to decompile stuff, since it seems the only way to restore file format.
I've tried OllyDbg, it looks really great, but it can't 16 bit.
So, is there a disassembler/debugger capable of working with such executables?
Thanks.
UPD: I know DOSbox, the app runs in it all right. The problem is, I don't need to run it, I need to understand the file format in which it writes data. Or maybe I don't know something about DOSbox and it can run as a debugger/decompiler as well? Or do you mean starting some old 16bit DOS debugger/decompiler in DOSbox? The latter sounds like an idea, but could you please name a decent DOS debugger, then?
disassembling tool:
use IDA Freeware https://www.hex-rays.com/products/ida/support/download_freeware.shtml
you won't find any better tool for reversing - even for old dos programs :)
most other tools are only capable of doing disassembling for 32bit and don't reach in any way the analyze features of IDA - its the gold standard tool of reverse engineering
debugger:
dosbox got its own builtin debugger (reachable through the "debug" command on command line)
but you need to build your own version of dosbox with activated debugger (oder better heavy-debug) see: http://www.vogons.org/viewtopic.php?t=3944
or if you got a ida licence with the sdk there is an dosbox<->ida-debugger plugin available
currently linux only https://github.com/wjp/idados
file format:
do you know what the file contains (what do you want from the file)
very complex information or "just" some lists of values?
maybe its better to start here with an hex-editor (http://mh-nexus.de/de/hxd/) and known result-values to compare
what program uses the data currently (or only the program itself)? maybe its possible to understand how the data is read in this program?
program itself:
how large is the exe?
console program or a big super gfx power app?
real 16bit or 32bit with dos-extender?
a single exe or overlays(dos-dlls)?
can you give access to the executable?
your turn
You're looking for IDA. It's the de-facto disassembler for pretty much anything.
You can get more help on this at https://reverseengineering.stackexchange.com/
You do not necessarily need to disassemble a program in order to figure out the format in which it writes data.
Perhaps you can do differential analysis on it. Change some inputs to the program, have it write the data, and watch how the file changes.
I have some vintage hardware devices here which can dump their NVRAM settings over MIDI in a binary format (in one case, a single SysEx message with a binary blob in it). If I wanted to know what the format is, I'd make small, systematic changes to the settings, and perform dumps, then see what bits in the binary data are changing.
You really are probably best off attacking the data, rather than the program.
Dosbox is probably a thing to try.
You might also look at http://hte.sourceforge.net/ .
I understand that each CPU/architecture has it's own instruction set, therefore a program(binary) written for a specific CPU cannot run on another. But what i don't really understand is why an executable file (binary like .exe for instance) cannot run on Linux but can run on windows even on the very same machine.
This is a basic question, and the answer i'm expecting is that .exe and other binary formats are probably not Raw machine instructions but they contain some data that is operating system dependent. If this is true, then what this OS dependent data is like? and as an example what is the format of an .exe file and the difference between it and Linux executables?
Is there a source i can get brief and detailed information about this?
In order to do something meaningful, applications will need to interface with the OS. Since system calls and user-space infrastructure look fundamentally different on Windows and Unix/Linux, having different formats for executable programs is the smallest trouble. It's the program logic that would need to be changed.
(You might argue that this is meaningless if you have a program that solely depends on standardized components, for example the C runtime library. This is theoretically true - but irrelevant for most applications since they are forced to use OS-dependent stuff).
The other differences between Windows PE (EXE,DLL,..) files and Linux ELF binaries are related to the different image loaders and some design characteristics of both OSs. For example on Linux a separate program is used to resolve external library imports while this functionality is built-in on Windows. Another example: Linux shared libraries function differently than DLLs on Windows. Not to mention that both formats are optimized to enable the respective OS kernels to load programs as quick as possible.
Emulators like Wine try to fill the gap (and actually prove that the biggest problem is not the binary format but rather the OS interface!).
.exe and other binary formats are [definitely] not Raw machine instructions but they contain some data that is operating system dependent.
what this OS dependent data is like? and as an example what is the format of an .exe file and the difference between it and Linux executables?
Well, I guess Google failed you utterly. .EXE formats are very well-defined by Windows documentation.
http://support.microsoft.com/kb/65122
The Linux ld application loads an executable into memory prior to "exec" to that file. You could read up on ld format or even the famous a.out file.
http://linux.die.net/man/1/ld
http://en.wikipedia.org/wiki/A.out
http://en.wikipedia.org/wiki/Executable
Apart from the executable format that must be recognized by the system loader (i.e. that part of an OS that brings the executable into memory) the real problem is the interface to the OS. You can think of an OS as a kind of API that provides entry points one must call for doing specific things, like for example, writing a character to the console.
These details are usually more or less hidden from the end user, so that you can achieve writing a character to the screen with the same source code in higher level languages. But often, things are more different, like for example the Windowing environment. Not all high level languages provide a windowing layer that abstracts even over those differences.
I can't comment too much on *nix but yes, the code part of the binary is typically happy to run on either environment, but it is the OS that places certain demands on the binary. In windows you should read up on PE Headers.
The second part is simply up to the developer, many times the code part will reference libaries that are OS specific - which is why you can have both portable and non-portable C++ code before being compiled into a binary.
A very naive answer:
Their structure are different because of different process loaders;
The use os-dependent features like syscalls, which vary from OS to OS.
Programs need to know how to invoke operating system services. How this is done depends on the operating system: some use interrupts, some use the x86 lcall instruction, some (notably Windows) have distinguished shared libraries and don't document how to directly invoke services. Old 680x0 Macs and some other 680x0 operating systems used a reserved instruction set area and trapped the resulting "invalid CPU opcode" exception. Moreover, even when the mechanism is the same, the order and argument format of system calls differs between operating systems (and sometimes different versions of the same operating system; see stat() in the Linux kernel for an example of an interface that has changed several times).
There is some ability to deal with other operating systems' conventions: FreeBSD has the "linuxulator" which handles the Linux-specific kernel interface, NetBSD similarly has emulators for the system call formats of other operating systems using the same hardware (say, Ultrix on MIPS or OSF/1 on Alpha), Linux used to have iBCS2 to handle the UnixWare/SCO Unix kernel interface, Wine provides replacement shared libraries and a binary loader for PE-style Windows executables. (I don't recall if Wine also supports OS/2-style LX .exes; it probably does handle original format .exe; and then there's .com which is a raw memory dump with a header slapped on.) Even so, there is always some format that uses different conventions, and sometimes the conventions are similar enough to require hints to the OS as to how to deal with it. (See bless on FreeBSD, for example.)
I'm looking to write a tool that aims to convert debug symbols of one format to another format that's compatible for use under GDB. This seems like a tedious and potentially complex project so I'm not exactly sure how to tackling it.
Intially I'm aiming to convert the Turbo Debug Symbol table(TDS) emitted from borland compilers into something like stabs or dwarf format(seems like dwarf is prefer from my research). But ideally I want to design my tool to be easy enough to extend so it could convert other formats too later on. e.g. codeview4 or maybe even pdb.
My primary motivation for creating this are:
Interoperability. If I can convert a foreign debug format into a form gdb can work with then source-level debugging would be possible on binaries compiled from another compiler other than gcc. This means any frontend debugging interface that uses gdb as a backend will work as well.
No other tools exist. I did a google searching around for similar tools and the closest I've found is tds2dbg. But it doesn't quite do what I'm looking for.
What I have to work with at the moment:
I already have a debug hook API that can understand the TDS debug format. I can use that to help me get at the needed information from the source format I'm converting from.
For the scope of this project, I'm mainly interested in getting this to work under the win32 environment. Other platforms and tools I'm not really concerned about.
The target dwarf debug format I'm converting to. This one I'm really not familiar with at all. I have used gcc ported compilers like MinGW before and debugged them with gdb with the dwarf format. But I don't have any idea how this format is implemented on windows.
The last point is the one I'm concerned about. I'm reading through the dwarf spec documentation but I find I'm having trouble really understanding and comprehending how it works. There's so much detail in there but at the same time it doesn't have any details about how dwarf gets implemented on object files and image files on a platform that doesn't use ELF natively -- namely the PE-COFF format that windows uses. The documentation is also a very dry read, long sentences make it hard to understand and diagrams and illustrations are sparse. I came across an API called libDwarf that should take most of the parsing work out of interpreting dwarf. The problem is I'm still trying to get it to build and I don't know yet how it will work out.
I haven't written any code yet since I don't fully understand what it is I need to build. I have a feeling the biggest hurtle will be figuring out how to work with dwarf due to it's complexity. Googling for information on how dwarf works under windows hasn't turned up anything helpful either. Like for example, there's no information about the 'glue' code that's needed to contain dwarf within a PE executable image file. How are the dwarf sections exactly layed out? Are there any header information for each section? GDB clearly doesn't just take a 'raw' dwarf debug file and use it as is. So what kind of format does gdb expect the debug file to be in for it to be able to work with it?
My question is, how can I start on such a project? More importantly, where can I turn to for help when I inevitably get stuck on a problem?
Affinic Assembler for Windows
Affinic Assembler is an x86/x86-64 assembler for Windows that takes GAS-syntax assembly source with DWARF debug information and generates corresponding CodeView format sections in object file in order to make the linked program debuggable in Visual Studio. This program is good for Cygwin and MinGW users to port Linux code to Windows.
http://www.affinic.com/?page_id=48
You are asking several questions here :-)
I think you are heading in the right direction, using libdwarf.
BUT, have you taken a look at objcopy to see if this tool can do some of the work for you? It probably doesn't support borland, pdb or codeview4, but it might be worth looking into. (Another approach may be to extend objcopy to support the formats you are trying to convert between.)
I have used the dwarf-discuss mailing list sometimes when I have become stuck.
http://lists.dwarfstd.org/listinfo.cgi/dwarf-discuss-dwarfstd.org
As for the questions on dwarf, split them into separate questions and I will do my best to
answer them. :-)
I'll admit upfront that I don't know a whole lot about ARM development, so I probably have by information wrong here.
Visual Studio comes with an ARM assembler (armasm.exe), which is extremely convenient because I use the tools included with VS for basically everything and I'm not too wild about paying for an ARM assembler that comes bundled with a C compiler that I'll never use from other companies.
Now, my understanding is that ARM binaries that are run on-the-metal need to be in a pure binary format instead of something like ELF or PE. Is ARMASM capable of outputting binaries that can run without an operating system? The MSDN documentation for ARMASM appears to be lacking in regards to that type of information.
If not, can you recommend a free ARM assembler that provides macro support and doesn't come bundled with a bunch of extra fluff?
The assembler just produces object files. It's up to the linker to produce the final, executable, file. I'm pretty sure Microsoft uses pretty much their usual linker, which produces PE format executables (which is a COFF variant, in case you care). Offhand, I don't know of a linker/locator that will take MS-COFF format object files and produce a pure binary output file (though that hardly means one doesn't exist -- I've never really looked for one).
Also note that running on the bare metal most means burning your file to some variant of ROM. That means you really don't need a pure binary output file -- what you really need is a file suitable for a ROM burner. That usually means Motorola S-records or Intel hex format (quite a few ROM burners accept both).
I know that doesn't give you a "final answer", but it should at least give you a few terms suitable for Googling to get more relevant information...