JVM - bytecode content difference after compiling - java-8

I have recently seen a behaviour which made me ask this on SO . I was hoping that people would be able to share their findings too.
Would a class file (bytecode) be different if the same file is compiled (unchanged) using JDK 1.8 u66 and JDK 1.8 u121? What I mean is the following:
1) I compile an application using JDK 1.8 u66
2) I make changes to 1 or 2 files and recompile using JDK 1.8 u66.
Could I expect some of the unchanged class files to have different binary content even though they haven't changed?
My reason is that when I took a hash of a file which wasn't changed as part of my steps above - they had same size on disk, but the hashcode was totally different. and I used Winmerge to compare these two versions where the size was reported as identical, but the binary contents were different. The following is what I have compared using Winmerge (blue marked item was something related to my source name so I had to mask it out) - but please observe the difference in 208 and 248.
Is this expected? if so, could someone please point me to that literature which explains this?
Regards,

Countless reasons exist, why the same Java source file may be compiled to different bytes by different compilers, where different versions of the same compiler should indeed be seen as different compilers. Even for the exact same compiler there is no guarantee that bytes are identical.
One such reason is, that all references in the code (other than opcodes and bytecode offsets) are indirected through the Constant Pool. The order of entries in the constant pool is not specified and hence it may change leading to all references using a different offset.
See also that JVMS has a section titled Compiling for the Java Virtual Machine, which, however, starts by saying:
The numbered sections in this chapter are not normative
As a result, reasoning works only in one direction: same bytes implies same source code, but different bytes doesn't necessarily imply different source code.
JDK-8067422, as linked from one comment, gives an example where even the same compiler can produce different bytes for the same source file (perhaps due to different set of source files compiled in the same compiler invocation). As per JLS and JVMS this is legal, just inconvenient.

Related

Using binary breakpoints in GDB - how exact is the location?

I have some memorydumps from Linux Redhat GCC compiled programs like:
/apps/suns/runtime/bin/mardb82[0x40853b]
When I open mardb82 and put the breakpoint with break *0x40853b it will give me C filename/lineno which seems quite correct, but not completely.
Can I trust it, and what does it depend on? Is it sufficient if the source file in question is the same or does the files making up the executable have to be the same?
Can I find the locations in sources in some other way?
(Max debug info and sources are present, I haven't tried not having the sources present or passing them in)
When I open mardb82 and put the breakpoint with break *0x40853b it will give me C filename/lineno which seems quite correct, but not completely.
A faster way to get the filename/line:
addr2line -fe /path/to/mardb82 0x40853b
You didn't say where the ...bin/mardb82[0x40853b] line came from. Assuming it is a part of a crash stack, note that the instruction is usually the next after a CALL, so you may be interested in 0x40853b-5 (on *86 architectures) for all but the innermost level in the stack.
what does it depend on? Is it sufficient if the source file in question is the same or does the files making up the executable have to be the same?
The instruction address depends on the particular executable. Any change to source code comprising that executable, to compilation or linking flags, etc. etc. may cause the instructions to shift to a different address.

Are the names of COFF Data Directories fixed?

I have a PE file (notepad), the NumberOfRvaAndSize value in the COFF header is 0x10, and there are 16 DataDirectory entries as expected.
The documentation says that this value can change (though I've never seen it), which would mean there were greater than of fewer than 16 entries.
Immediatly after there's a list of 16 data directories complete with names.
Are these names just always the same, in that exact order?
If there are fewer, will it always be whatever directories are at the end that will be missing?
If there are greater than 16, what names are they assigned?
It's always a matter of specification vs. implementation.
Are these names just always the same, in that exact order?
As for the names (I guess you are referring to the section names?), no they can change. You can name them whatever you want although most implementations (i.e linkers) will keep the specification names (e.g .reloc for the relocations).
The order is fixed; You can refer to them by their numbers.
If there are fewer, will it always be whatever directories are at the end that will be missing?
I'm not sure a valid PE (which can be loaded by an actual supported system) can have fewer than 16 data directories. It might be possible though as the location of the section headers is probably calculated using the FILE_HEADER.SizeOfOptionalHeader.
The reference implementation for loading a PE file (the Windows Loader) is not open source so it's not easy to answer this question.
My guess is that it could work: it's like trying to load a win2K PE on a windows 10 system (given that it is importing functions that are still present on a windows 10 system). It would be like the CLR data directory is just not there.
If there are greater than 16, what names are they assigned?
You can't have more than 16 data directories because the maximum number is 16. I'm pretty sure the Windows loader would not load a PE file with more than 16 data directories.
The documentation says that this value can change (though I've never seen it), which would mean there were greater than of fewer than 16 entries.
The number is fixed to 16 right now. For example the last addition was the CLR data directory which was added to load the CLR with the introduction of .NET. Before that, the number was 15, so yes the value can change and will not always be 16, but this doesn't mean it changes between PEs. What I mean is that, at a given time, for a supported system, all PEs will have the same number of data directories.
My guess is that, at the time of the introduction of .NET (with the CLR data directory) there were PEs with 15 data directories and others with 16. The windows loader was probably patched to account for the two different numbers. Right now it is probable that the number is fixed to only 16.

How can an executable be this small in file size?

I've been generating payloads on Metasploit and I've been experimenting with the different templates and one of the templates you can have your payload as is exe-small. The type of payload I've been generating is a windows/meterpreter/reverse_tcp and just using the normal exe template it has a file size around 72 KB however exe-small outputs a payload the size of 2.4kb. Why is this? And how could I apply this to my programming?
The smallest possible PE file is just 97 bytes - and it does nothing (just return).
The smallest runnable executable today is 133 bytes, because Windows requires kernel32 being loaded. Executing a PE file with no imports is not possible.
At that size it can already download payload from the Internet by specifying an UNC path in the import table.
To achieve such a small executable, you have to
implement in assembler, mainly to get rid of the C runtime
decrease the file alignment which is 1024 by default
remove the DOS stub that prints the message "This program cannot be run in DOS mode"
Merge some of the PE parts into the MZ header
Remove the data directory
The full description is available in a larger research blog post called TinyPE.
For EXE's this small, the most space typically is used for the icon. Typically the icon has various sizes and color schemes contained, which you could get rid of, if you do not care having an "old, rusty" icon, or no icon at all.
There is also some 4k of space used, when you sign the EXE.
As an example for a small EXE, see never10 by grc. There is a details page which highlights the above points:
https://www.grc.com/never10/details.htm
in the last paragraph:
A final note: I'm a bit annoyed that “Never10” is as large as it is at
85 kbyte. The digital signature increases the application's size by
4k, but the high-resolution and high-color icons Microsoft now
requires takes up 56k! So without all that annoying overhead, the app
would be a respectable 25k. And, yes, of course I wrote it in
assembly language.
Disclaimer: I am not affiliated with grc in any way.
The is little need for an executable to be big, except when it contains what I call code spam, code not actually critical to the functionality of the program/exe. This is valid for other files too. Look at a manually written HTML page compared to one written in FrontPage. That's spamcode.
I remember my good old DOS files that were all KB in size and were performing practically any needed task in the OS. One of my .exes (actually .com) was only 20 bytes in size.
Just think of it this way: just as in some situations a large majority of the files contained in a Windows OS can be removed and still the OS can function perfectly, it's the same with the .exe files: large parts of the code is either useless, or has different than relevant-to-objective purpose or are intentionally added (see below).
The peak of this aberration is the code added nowdays in the .exe files of some games that use advanced copy protection, which can make the files as large as dozens of MB. The actually code needed to run the game is practically under 10% of the full code.
A file size of 72 KB as in your example can be pretty sufficient to do practically anything to a windows OS.
To apply this to your programming, as in make very small .exes, keep things simple. Don't add unnecessary code just for the looks of it or by thinking you will use that part of the program/code at a point.

Is dual mode executable possible?

A bit of history... I have 3 systems that I spend time on, a DOS 6.22 system, a Windows 95 system, and a modern Windows 7 (64-bit) system. When I upgraded to Win7-64, some of my favorite command line utilities stopped working, so I decided to re-write them myself. The only 2 compilers I have are Borland Turbo C++ 3.0 and Visual Studio 2008, and they worked fine for building 2 versions, a DOS 16-bit, and a Windows 7 32-bit (could have built 64-bit too, I guess.) The problem came with my Win95 system. The DOS version works fine there, but since I spent the time to support LFNs in the Win7 build, I wanted it with my Win95 system. So, after a lot of research, I found and purchased Visual Studio 6 (last one with Win95 support according to what I researched,) copied the code over (had to rewrite sections, of course,) and it compiled just fine, and works :)
The problem occurred the next time I had to boot my Win95 system in DOS mode. The program stopped working (of course,) because Win95 wasn't loaded. I don't really want to have 2 copies of the program installed (needing 2 different file names,) so I was hoping there was a way to link the 2 versions together into one file. If I execute it in DOS, instead of it saying it requires windows, it would just jump to the DOS section of the program. That way, it would be a single program, with LFN support if Win95 is loaded, and without if Win95 isn't loaded. Since the Win95 version also works fine in Win7-64, it would probably also produce a single version that works on all 3 systems (which would be an added bonus.)
I did some web searches, and couldn't find anything germane to what I'm looking for. So I have no idea if it is even possible. I may have to get yet another compiler, but considering how old it would have to be, I could probably afford it. My web searches did result in information that leads me to believe that it "should" be possible, though. It would just require a different exe header than the one Windows compilers put in. It may require that I re-write the DOS version for 32-bit and use a DOS extender (for protected mode, assuming I can't find a way to include it in the file itself.) That would be acceptable (though not ideal.) I would much rather have 16-bit code in the DOS section, and 32-bit code in the Windows section (for the most compatibility.)
Does anyone have any information about something like this? If you could just point me in the right direction it would be greatly appreciated.
I don't know if it has been continued in Windows 7 executables, but back in Win95 the executable (EXE) actually had two entry points -- one "normal" one that DOS would find, and a second one that Windows would use. The DOS entry point was usually a very simple default that would just print "This is a Windows program" and exit. You can actually override this default, and have the linker use your own code, however it is very limited.
What I'd recommend doing is add logic to your DOS 6.22 version (e.g. "sed") that would check the OS level & if it meets the right criteria, pass the parameters along to a second executable (e.g. "sedx") that uses features from the "newer" OS.
The documentation for Visual Studio 6 describes the /STUB option here, simply point this at the DOS version of your program.
I don't have VS6 handy, so I can't be too specific, but in the project settings GUI, there should be an "additional options" setting in the linker section.
Well the answer is the /stub option in the Linker you are using for your Windows code. Some additional information for anyone who finds the question later.... I had to do several days of web searches to find that there doesn't appear to be another answer to my particular problem.
Stub requires that the DOS mode executable have a header of at least 40 bytes. After fighting with multiple compilers that "DO" give you a header of the right size (Borland Turbo C++ won't,) and not being able to convert my code, I had to get sneaky/fancy. BTW - Visual C 1.52c (last Visual C that supports DOS,) will make a correct header, as will Open WatCom.
If you are faces with the same issue I was - the compiler you used won't make the correct size header, and your code is too compiler specific to convert easily, you can do what I ended up doing. I used Open WatCom to write a tiny ("Hello World") Windows program using my exe with the short (Borland created,) header as the stub. Open WatCom will adjust the header automatically. I then used a Hex Editor to read the header information to get the ending address of the stub and a partial file copier to copy only that part of the program to a file I named "stub.exe" (stripping of the Windows code.) Using the same Hex Editor I zeroed out the PE pointer in the header. I now had a working DOS exe that would also work as a stub. Took my stub to my Windows compiler, and linked it in. It works great, all features fully realized :)
FYI - Information needed to strip the Windows portion and zero the PE pointer.
first byte is offset 0 (of course, but some people may not realize that, and think it's byte 1.) Also remember, that most Hex Editors (by their very name,) are giving you numbers in hexadecimal format.
offset 2 & 3, number of bytes in the last block of the DOS portion of the file in low byte - high byte format. That is, offset 2 is low, 3 is high. So take them, reverse them, and you will get a number from 0 - 511 (0 - 1ff in hex.) 0 means the entire block of 512 (200 in hex) bytes is used.
offset 4 & 5 (again in low/high format,) is the number of 512 (200 in hex) byte block in the DOS portion. Remember to reverse the number, and that the last block may only be a partial block. So, subtract one, multiply by 512 (200 hex,) add the number from 2-3, and you have how many bytes are in the DOS portion. Since you are starting from 0, subtract 1, and you now know to only copy bytes 0 - "whatever the total is" to your stub exe.
offset 60-61 (hex 3C-3D) is the pointer to the start of the PE (or Portable Executable,) portion of the code (the part that Windows jumps to.) It should be just past (mine was padded with a few zeroes,) the end of the DOS portion of the code. This isn't important at this time, as we are just turning those into 0's anyway (the PE portion has been stripped.) You can use this as confirmation that you have the correct "end of DOS" offset selected though.
The tools I used are:
Open WatCom at http://www.openwatcom.org/index.php/Main_Page
and
Part Copy at http://www.virtualobjectives.com.au/utilitiesprogs/partcopy.htm
I have no idea where to find the Hex Editor I used. I used CEdit, a DOS program I really like, but have been unable to find on the net. Have to use DOSBox with it as Win7 won't run it, though. There are probably other compilers that do the same thing, and probably tons of partial file copiers available. These are the tools I used.

What are gcc linker map files used for?

What are the ".map" files generated by gcc/g++ linker option "-Map" used for ?
And how to read them ?
I recommend generating a map file and keeping a copy for any software you put into production.
It can be useful for deciphering crash reports. Depending on the system, you likely can get a stack dump from the crash. The stack dump will include memory addresses and one of the registers will include the Instruction Pointer. That tells you the memory address code was executing at. On some systems, code addresses can be moved around (when loading dynamic libraries, hence, dynamic), but the lower order bytes should remain the same.
The map file is a MAP from memory location -> code location. It gives you the name of the function at a given memory address. Due to optimizations, it may not be extremely accurate, but it gives you a place to start in terms of looking for bugs that cause the crash.
Now, in 30 years of writing commercial software, this is the only thing I've used the map files for. Twice successfully.
What are the ".map" files generated by gcc/g++ linker option "-Map" used for?
There is no such thing as 'gcc linker' -- GCC and linker are independent and separate projects.
Usually the map is used for understanding decisions that ld made while linking the binary. From man ld:
-M
--print-map
Print a link map to the standard output.
A link map provides information about the link, including the following:
· Where object files are mapped into memory.
· How common symbols are allocated.
· All archive members included in the link, with a mention of the symbol which caused the archive member to be brought in.
· The values assigned to symbols.
...
If you don't understand what that means, you likely don't (yet) have the questions that this output answers, and hence have no need to read it.
The compiler gcc is one program that generates object code files, the linker ld is a second program to combine the object code files into an executable. The two can be combined into a single command line.
If you are generating a program to run on an ARM processor you need to use arm-none-eabi-gcc and arm-none-eabi-ld so that the code will be correct for the ARM architecture. Gcc and ld will generate code for your host computer.

Resources