Clarification on Binary file (PE/COFF & ELF) formats & terminology

Clarification on Binary file (PE/COFF & ELF) formats & terminology - windows

I'm confusing little in terminology.
A file that is given as input to the linker is called Object File.
The linker produces an Image file, which in turn is used as input by the loader.
I got this from "MS PE & COFF Specification"
Q1. Image file is also referred to as Binary Image, Binary File or just Binary. Right?
Q2. So, according to the above stated terminology, the PE/ELF/COFF are the formats of Image File & not the Object File. right? But http://www.sco.com/developers/gabi/latest/ch4.intro.html says
This chapter describes the object file format, called ELF (Executable and Linking Format). There are three main types of object files.
A relocatable file holds code and data suitable for linking with other
object files to create an executable
or a shared object file.
An executable file holds a program suitable for execution; the
file specifies how exec(BA_OS) creates
a program's process image.
A shared object file holds code and data suitable for linking in two
contexts. First, the link editor [see
ld(BA_OS)] processes the shared object
file with other relocatable and shared
object files to create another object
file. Second, the dynamic linker
combines it with an executable file
and other shared objects to create a
process image.
contradictorily he is saying that both Object File & Image File are ELF formats & He is not at all differentiating between object & image files but referring them commonly as Object files. Isn't it wrong?
Q3. I know that PE is derived from COFF. But why does the Microsoft specifications of PE format is named Microsoft Portable Executable "and Common Object File Format Specification". Do they still support COFF? If they, in which OS? I thought PE completely replaced COFF long ago.

I'm the OP. Every one's answer is a partial answer. So, I'm combining all the other answers with what I learnt to complete the answer.
This is the "Generally" used terminology.
A file that is given as input to the linker (output of assembler) is called Object File or Relocatable File.
The linker produces an Image file, which in turn is used as input by the loader. Now, an Image file can either be an Executable file or Library file. These 'Library files' are of two kinds:
Static Library (*.lib files for windows. *.a for linux)
Shared/Dynamic libraries : DLL ( *.dll on windows) & Shared Object file( *.so in Linux)
The term Binary File / Binary can be used to refer to either an ObjectFile or an ImageFile. Undestand depending up on the context. It is a very general term.
Loader when loads the image file into memory. Then it is called Module (I'm not sure about Linux guys, but windows guys call it Module
http://www.gliffy.com/pubdoc/1978433/L.jpg
alt text http://www.gliffy.com/pubdoc/1978433/L.jpg
As I said, these are "Generally" used terminology. There are no strict definitions for the terms 'binary file', 'image file', or 'object file'.
Particularly the term 'object file' might sometimes be used to mean an intermediate file output by the compiler for use by the linker, but in another context might mean an executable file.
Especially on different platforms they might be used for refer to different or similar things. Even when discussing issues on a single platform, one writer might use the terms somewhat differently than another.
Both ObjectFile & ImageFile are in PE Format in windows & ELF format in linux.
ELF is not only the format of the image file but is also the format of the object file.
Every ELF file starts with an ELF Header. The second field of an ELF Header is e_type; this fields lets us know whether the file is an object file (aka a relocatable in ELF parlance), or an image (which can be either an executable or a shared object) or something else (core file's are also ELF files).
I don't know if there is any bit in header that differentiates an Object file from Image file. It needs to be checked.
I know that PE is derived from COFF.
But why does the Microsoft
specifications of PE format is named
Microsoft Portable Executable "and
Common Object File Format
Specification". Do they still support
COFF? If they, in which OS? I thought
PE completely replaced COFF long ago.
As far as "PE" vs "COFF", my recollection is that Microsoft use the "COFF" specification as the starting point for the "PE" specification but extended it for their needs. So strictly speaking a "PE" file isn't a "COFF" file, but it's very similar in many ways.

There are no strict definitions for the terms 'binary file', 'image file', or 'object file'.
Particularly the term 'object file' might sometimes be used to mean an intermediate file output by the compiler for use by the linker, but in another context might mean an executable file.
Especially on different platforms they might be used for refer to different or similar things. Even when discussing issues on a single platform, one writer might use the terms somewhat differently than another.
As far as "PE" vs "COFF", my recollection is that Microsoft use the "COFF" specification as the starting point for the "PE" specification but extended it for their needs. So strictly speaking a "PE" file isn't a "COFF" file, but it's very similar in many ways.

gcc -c will produce a .o file, which is an elf format object file, on a Linux system. "ELF 32-bit LSB relocatable, Intel 80386, version 1 (SYSV)" is how a .o file is described by the file command on my machine.

In regards to Q2 for ELF, ELF is not only the format of the image file but is also the format of the object file.
Every ELF file starts with an ELF Header. The second field of an ELF Header is e_type; this fields lets us know whether the file is an object file (aka a relocatable in ELF parlance), an image (which can be either an executable or a shared object) or something else (core file's are also ELF files).

As an aside, I know core dumps on Solaris (and I am guessing other Unix flavors) can be in ELF format.

Related

Extract Structure definitions from executable

I need to extract structure definitions from an executable. How can I do that?
I read we can do it using ELF, but not sure how to do this. Any help here?

I read we can do it using ELF, but not sure how to do this.
What you probably read is that if a binary contains debug info, then the types of variables, structures, and great many other kinds of info can be extracted from that binary.
This isn't specific to ELF: many other executable formats (such as COFF) allow for embedding of debugging info as well.
Further, the format of that debugging info is different between different platforms. Some of the common UNIX ones are DWARF and STABS (with DWARF being more recent and much more powerful).
If you have an ELF binary, and you suspect that it may contain DWARF debug info, you can decode it using readelf -wi a.out (be prepared for there to be a lot of info, if any is present at all). objdump -g can be used to decode STABS (recent objdump versions can decode DWARF as well).
Or, as suggested by tristan, you can load the executable into GDB and use info types and ptype commands.
If the binary doesn't contain debug info, then DrPrItay's answer is correct: you can't easily recover structure definitions from it. However, you still can recover them by using reverse-engineering techniques. For example, many struct definitions used by the Wine project (example) were obtained by such techniques.

As much as I know, you can't. c / c++ programs are not like java, structs dont gain a symbol. Their just definitions for your compiler about how to align and pack variables within stack frames or some other memory (struct data members). For example unlike java you dont have what resembles class loading when loading shared objects's (no header file included within your c program ) you can only load global variables and functions. Defining a struct is much as creating some data type, it's definition should be only present for compilation, you dont get a symbol within the symtable for int or char then why should you for some struct? It simply makes no sense. Symbols aee soley meant for objects that your compiler doesn't recognize during compilation - link time/load time/run time

GCC proper visibility for shared object written in C++

I have a huge project written in C++. It's all split into multiple static libraries that are eventually linked into one final shared library which has to export only a few simple functions.
If I do objdump of that final .so I see all my internal names etc. Because it uses long class names and namespaces these strings become excessively long and as a result final binary is big.
So, my question is how do I do it properly with GCC to make sure that all these internal functions do not show up in the final binary?
I'm aware about all these GCC-specific visibility modifiers, I use -fvisibility=hidden -fvisibility-inlines-hidden, I use -Wl,--no-whole-archive. I disable c++ exceptions and rtti (-fno-exceptions -fno-rtti) but i still can't get GCC to generate my final .so that doesn't contain names of my namespaces and classes that aren't supposed to be there at all!
I tried to use -Wl,--version-script= to control which functions should be visible, but still I see lot's of internal names in final stripped shared object. I read multiple similar entries on SO, but don't see anything that does the job.
Note: I compile for multiple platforms (Linux, Windows, iPhone etc) and only on windows in VS I don't have any problems.
thanks

You might want to try the --retain-symbols-file linker option when linking the final .so file (-Wl,--retain-symbols-file=filename) to specify JUST the symbols you want to keep (export) and delete everything else. The file is just a text file with symbols (one per line) to keep.

How can I add sections to an existing (OS X) executable?

Is there any way of adding sections to an already-linked executable?
I'm trying to code-sign an OS X executable, based on the Apple instructions. These include the instruction to create a suitable section in the binary to be signed, by adding arguments to the linker options:
-sectcreate __TEXT __info_plist Info.plist_path
But: The executable I'm trying to sign is produced using Racket (a Scheme implementation), which assembles a standalone executable from Racket/scheme code by cloning the 'real' racket executable and editing the Mach-O file directly.
So the question is: is there a way I can further edit this executable, to add the section which is required for the code-signing?
Using ld doesn't work when used in the obvious way:
% ld -arch i386 -sectcreate __TEXT __info_plist ./hello.txt racket-executable
ld: in racket-executable, can't link with a main executable
%
That seems fair enough, I suppose. Libtool doesn't have any likely-looking options, and neither does the redo_prebinding command (which is at least a command for editing executables).
The two possibilities suggested by the relevant Racket list were (i) to extend the the racket compilation tool to adjust the surgery which is done on the executable (feasible, but scary), or (ii) to create a custom racket executable which has the desired section already in place. Both seem like sledgehammer-and-nut solutions. The macosx-dev list didn't come up with any suggestions.

I think this is infeasible.
There appear to be no stock commands which edit a Mach-O object file (which includes executables). The otool command allows you to view the structure of such a file (use otool -l), but that's about it.
The structure of a Mach-O object file is documented on the Apple reference site. In summary, a Mach-O object file has the following structure:
a header, followed by
a sequence of 'load commands', followed by
a sequence of 'segments' (some of the load commands are responsible for pointing to the offsets of the segments within the file)
The segments contain zero or more 'sections'. The header and load commands are deemed to be in the first segment of the file, before any of that segment's sections. There are a couple of dozen load commands documented, and other commands defined in the relevant header file, therefore clearly private.
Adding a section would imply changing the length of a segment. Unless the section were very small, this would require pushing the following segment further into the file. That's a no-no, because a lot of the load commands refer to data within the file with absolute offsets from the beginning of the file (as opposed to, say, the beginning of the segment or section which contains them), so that relocating the segments in a putative Mach-O editor would involve patching up a large number of offsets. That doesn't look easy.
There are one or two Mach-O file viewers on the web, but none which do much that's different from otool, as far as I can see (it's not particularly hard: I wrote most of one this afternoon whilst trying to understand the format). There's at least one Mach-O editor, but it didn't seem to do anything that lipo didn't do (called 'moatool', but the source appears to have disappeared from google code).
Oh well. Time to find a Plan B, I suppose.

The gimmedebugah tool is able to modify the __info_plist TEXT section of an existing binary. See https://reverse.put.as/2013/05/28/gimmedebugah-how-to-embedded-a-info-plist-into-arbitrary-binaries/
It is available here: https://github.com/gdbinit/gimmedebugah

Does exist any utility to know the size of a compiled function in an executable?

I want a report showing me the size of diferent symbols(compiled) in the executable. Something like .map files in Delphi, but generic if possible. nm from binutils, shows start address(?), maybe could i use that information?
(I'm using object pascal + freepascal compiler)

FPC/LD can generate mapfiles too
various ways to analyze .o files. (nm, objdump and parse the address increments between sections)
maybe the information is stored in the .ppu, have a look in the ppu unit (compiler dir) which contains .ppu loaders

Size of a library and the executable

I have a static library *.lib created using MSVC on windows. The size of library is say 70KB. Then I have an application which links this library. But now the size of the final executable (*.exe) is 29KB, less than the library. What i want to know is :
Since the library is statically linked, I was thinking it should add directly to the executable size and the final exe size should be more than that? Does windows exe format also do some compression of the binary data?
How is it for linux systems, that is how do sizes of library on linux (*.a/*.la file) relate with size of linux executable (*.out) ?
-AD

A static library on both Windows and Unix is a collection of .obj/.o files. The linker looks at each of these object files and determines if it is needed for the program to link. If it isn't needed, then the object file won't get included in the final executable. This can lead to executables that are smaller then the library.
EDIT: As MSalters points out, on Windows the VC++ compiler now supports generating object files that enable function-level linking, e.g., see here. In fact, edit-and-continue requires this, since the edit-and-continue needs to be able to replace the smallest possible part of the executable.

There is additional bookkeeping information in the .lib file that is not needed for the final executable. This information helps the linker find the code to actually link. Also, debug information may be stored in the .lib file but not in the .exe file (I don't recall where debug info is stored for objs in a lib file, it might be somewhere else).

The static library probably contains several functions which are never used. When the linker links the library with the main executable, it sees that certain functions are never used (and that their addresses are never taken and stored in function pointers), it just throws away the code. It can also do this recursively: if function A() is never called, and A() calls B(), but B() is never otherwise called, it can remove the code for both A() and B(). On Linux, the same thing happens.

A static library has to contain every symbol defined in its source code, because it might get linked into an executable which needs just that specific symbol. But once it is linked into an executable, we know exactly which symbols end up being used, and which ones don't. So the linker can trivially remove unused code, trimming the file size by a lot. Similarly, any duplicate symbols (anything that's defined in both the static library and the executable it's linked into gets merged into a single instance.

Disclaimer: It's been a long time since I dealt with static linking, so take my answer with a grain of salt.
You wrote: I was thinking it should add directly to the executable size and final exe size should be more than that?
Naive linkers work exactly this way - back when I was doing hobby development for CP/M systems (a LONG time ago), this was a real problem.
Modern linkers are smarter, however - they only link in the functions referenced by the original code, or as required.

Additionally to the current answers, the linker is allowed to remove function definitions if they have identical object code - this is intended to help reduce the bloating effects of templated code.

#All: Thanks for the pointers.
#Greg Hewgill - Your answer was a good pointer. Thanks.
The answer i found out was as follows:
1.)During Library building what happens is if the option "Keep Program debug databse" in MSVC (or something alike ) is ON, then library will have this debug info bloating its size.
but when i statically include that library and create a executable, the linker strips all that debug info from the library before geenrating the exe and hence the exe size is less than that of the library.
2.) When i disabled the option "Keep Program debug databse", i got an library whose size was smaller than the final executable, which was what i thought is nromal in most situations.
-AD

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio