How to syntax highlight a hex dump of JVM classfile? - syntax-highlighting

I'm studying binary file structure of JVM classfile.
My current toolbox consists of $ xxd <classfile> and $ javap -v <classfile>.
Sample outputs of these two tools are as follows:
$ xxd com/example/mycode/MyTest.class
00000000: cafe babe 0000 003d 001d 0a00 0200 0307 .......=........
00000010: 0004 0c00 0500 0601 0010 6a61 7661 2f6c ..........java/l
00000020: 616e 672f 4f62 6a65 6374 0100 063c 696e ang/Object...<in
00000030: 6974 3e01 0003 2829 5609 0008 0009 0700 it>...()V.......
...
000001a0: 0000 000a 0002 0000 0005 0008 0006 0001 ................
000001b0: 001b 0000 0002 001c ........
and
$ javap.exe -v com/example/mycode/MyTest.class
Classfile /<PathTo>/MyTest.class
Last modified 2022/11/01; size 440 bytes
...
interfaces: 0, fields: 0, methods: 2, attributes: 1
Constant pool:
#1 = Methodref #2.#3 // java/lang/Object."<init>":()V
#2 = Class #4 // java/lang/Object
#3 = NameAndType #5:#6 // "<init>":()V
#4 = Utf8 java/lang/Object
#5 = Utf8 <init>
#6 = Utf8 ()V
...
#27 = Utf8 SourceFile
#28 = Utf8 MyTest.java
{
public com.example.mycode.MyTest();
descriptor: ()V
flags: (0x0001) ACC_PUBLIC
Code:
stack=1, locals=1, args_size=1
0: aload_0
1: invokespecial #1 // Method java/lang/Object."<init>":()V
4: return
LineNumberTable:
line 3: 0
public static void main(java.lang.String[]);
...
}
SourceFile: "MyTest.java"
But, from these two outputs, it is difficult to comprehend which part of one output
correspond which part of the other.
It is hard to analyze the hex dumped binary by comparing with the disassembled output.
In this particular case I could manually assign tags by referring
specification,
but it was hard work even if the sample file is a trivial hello world.
In general large files such method is hard to be done.
Edit: made the question prorer
So what I want to do is the following:
Syntax highlight the xxd dump output along classfile structure
so that easily view which part is, for example, the constant pool part,
or the method info and attributes, in order to easily compare
with javap output.
More aggressively, it is useful to view javap and xxd outputs
side by side, and selecting a text on one side results in
highlighting corresponding text on the other side.
So, my question:
Is there any way or any other tools to understand
xxd hex dump output in terms of javap decompiled output?
Especially I'd like to comprehend that each hex corresponds to
each decompiled entry one-to-one.
My current idea is to highlight colors on hex dump,
possibly like the following image.
Is there any software to do like this?
Maybe I need to do some coding,
something like writing parser of .class-file.
Then, which is the efficient way to do it in less effort
to obtain highlighted hex dump with format tag annotations
according to the .class-file specs, like shown in the image below?
Thank you for reading.

Related

Why BER-TLV "DF9A" tag is recognized as "invalid"?

I have problem with understanding why all the BER-TLV parsers I found:
https://paymentcardtools.com/emv-tlv-parser
https://emvlab.org/tlvutils/
https://chrome.google.com/webstore/detail/tlv-parser/iemijfhdlipdpcjfnphcdalpccnkfedb
Recognize this tag: DF9A03001736 as "invalid", while: DF5603001736 and DF0903001736 work just fine.
What's the difference?
just follow the description provided in EMV Book 3, Annex B1
"invalid" case: DF9A03001736
DF - 1101 1111
9A - 1001 1010 - here, in the second byte of the Tag, the b8 is set (1), which means that 'Another byte follows', so the following byte (value 03) is also part of the Tag
03 - 0000 0011 - the last byte of the Tag, i.e. the actual Tag is DF9A03
so, in our sequence we have:
DF9A03 - Tag
00 - Length (no value)
17 - is already a new Tag
36 - length of the Tag 17 ...
So the parser (https://paymentcardtools.com/emv-tlv-parser) fails because no data is available (Error during parsing Tag 17: Not enough data)
correct example: DF5603001736
DF - 1101 1111
56 - 0101 0110 - there are no more bytes that constitute Tag, so we just have Tag DF56
the sequence is:
DF56 - Tag
03 - Length
001736 - Value

M68k-elf-gcc Floating Point Issues

I am trying to build a temperature control application for a 68000 processor. I am currently using GCC 8.2.0. I am compiling with the -msoft-float flag. However, the floating point library routines appear to be broken. Example:
'000174f4 <__ltdf2>:'
'174f4: 4e56 0000 linkw %fp,#0'
'174f8: 4878 0001 pea 1 <ADD>'
'174fc: 2f2e 0014 movel %fp#(20),%sp#-'
'17500: 2f2e 0010 movel %fp#(16),%sp#-'
'17504: 2f2e 000c movel %fp#(12),%sp#-'
'17508: 2f2e 0008 movel %fp#(8),%sp#-'
'1750c: 61ff bsrs 1750d <__ltdf2+0x19>'
'1750e: ffff .short 0xffff'
'17510: fd94 .short 0xfd94'
'17512: 4e5e unlk %fp'
'17514: 4e75 rts'
'17516: 4e71 nop'
Can someone explain why this code is generated or what is happening here? No way will a 68000 branch to an odd address.
UPDATE
I've been digging into this, and the problem appears to be injected during linking. Dumping the code for this function from libgcc.a shows the following:
`00000000 <__ltdf2>:`
` 0: 4e56 0000 linkw %fp,#0`
`4: 4878 0001 pea 1 <__ltdf2+0x1>`
` 8: 2f2e 0014 movel %fp#(20),%sp#-`
` c: 2f2e 0010 movel %fp#(16),%sp#-`
`10: 2f2e 000c movel %fp#(12),%sp#-`
`14: 2f2e 0008 movel %fp#(8),%sp#-`
`18: 61ff 0000 0000 bsrl 1a <__ltdf2+0x1a>`
`1e: 4e5e unlk %fp`
` 20: 4e75 rts`
So the linker must be trying to fill in the branch offset and messing up. Since the source for this function is a string of macros, I'm not sure where it really wanted to branch to.
On MC68020 and later, bra, bsr and bcc are followed by a 32bit displacement, if the 8bit displacement is 0xff. In your case this does not make sense, since 0xfffffd94 would fit into 16bit.
Make sure to compile (and that your softfloat-library is compiled) for 68000, if you don't have a 68020 or later.

Understanding disassembled binary from Objdump - What are the fields from the output

I get the following output when I disassembled a simple ARM binary file using the command "arm-linux-gnueabihf-objdump -d a.out"
00008480 <_start>:
8480: f04f 0b00 mov.w fp, #0
8484: f04f 0e00 mov.w lr, #0
8488: bc02 pop {r1}
848a: 466a mov r2, sp
What do different columns represent here? For example, 8480 and f04f 0b00 (from the 2nd line of code)
The first column is the address of the code in memory. 0x8480 means the memory address of this piece of code is 0x8480.
The second column is the hex representation of the code. f04f 0b00 means from memory address 0x8480(included) to 0x8484(excluded), there are 4 bytes of code, f0, 4f, 0b, 00.
The remaining is the disassembled code. They are disassembled from the code in second column.

How to get variable names out of a dump and a symbol file?

I'm debugging dump files, while I have access to the symbol files.
I'm using a script, which combines the results of following windbg commands:
x /2 *!* // which types are present in the symbol files?
!heap -h 0 // which memory address ranges are used in the dump?
(At least, that's how I understand it. In case I'm wrong don't hesitate to correct me)
The result of the script (called heap_stat.py, found under this list of Windbg extensions) is a list of memory addresses, followed by their type. By taking statistics of those I can derive if there is a memory leak.
In top of this, using the WinDbg command dt CStringArray m_nSize of the mentioned memory address (in the specific case of a CStringArray, of course), I can see the total amount of entries of the used CStringArray objects, and see if there are CStringArray objects with lots of entries.
However there is a drawback to this system:
When I find such a CStringArray object with a lot of entries, I'm stuck there because out of all the CStringArray objects in my application, I have no idea which one I'm dealing with.
One thing which might help is the name of the local variable the memory address is about, but there's the catch: I don't know if this information is present and in case yes, can I find it in the symbol file or in the dump, and (obviously) which command do I need to run in order to get this information (I didn't find a flag that makes dt <flag> <memory address> return the local variable name, occupied by <memory address>)?
Can anybody help me?
For clarification purposes, this is how the result of my script currently looks like:
0x0065a4d0 mfc110u!CStringArray Size:[1]
0x0065a4e4 mfc110u!CStringArray Size:[0]
0x0065a4f8 mfc110u!CStringArray Size:[295926]
0x0065a520 mfc110u!CStringArray Size:[0]
As you can see, I can see the memory address where my variable is stored, I can see the type (retrieved from the symbols files), and I can see the amount of entries (retrieved from the dt Windbg command), but I'd like to have an output like the following:
0x0065a4d0 mfc110u!CStringArray Size:[1] var1
0x0065a4e4 mfc110u!CStringArray Size:[0] var2
0x0065a4f8 mfc110u!CStringArray Size:[295926] var3
0x0065a520 mfc110u!CStringArray Size:[0] var3
or:
0x0065a4d0 mfc110u!CStringArray Size:[1] obj1.prop1
0x0065a4e4 mfc110u!CStringArray Size:[0] obj2.prop1
0x0065a4f8 mfc110u!CStringArray Size:[295926] obj1.prop2
0x0065a520 mfc110u!CStringArray Size:[0] obj1.prop2
Such an output would indicate me that I need to verify what is happening to var3 or obj1.prop2 in the source code.
I wrote the following MFC application (partial source):
CStringArray stringarray;
void* anotherarray = new CStringArray();
void CLocalVariableNameApp::AnotherMethod(void* a)
{
CStringArray* temp = static_cast<CStringArray*>(a);
temp->Add(L"Something else");
}
CLocalVariableNameApp::CLocalVariableNameApp()
{
stringarray.Add(L"Something");
AnotherMethod(anotherarray);
CStringArray* holycow = static_cast<CStringArray*>(anotherarray);
holycow->Add(L"Yet something else");
DebugBreak();
}
As you can see, the second CStringArray has multiple names: anotherarray, a, temp and holycow.
Observation 1: there will not be an easy 1:1 mapping on memory addresses and variable names.
Local variable names are available from the PDB files:
0:000> k
# ChildEBP RetAddr
00 0022fa40 00fa2a30 KERNELBASE!DebugBreak+0x2
01 0022fb3c 00f97958 LocalVariableName!CLocalVariableNameApp::CLocalVariableNameApp+0x90 [c:\...\localvariablename.cpp # 38]
02 0022fc10 014b628a LocalVariableName!`dynamic initializer for 'theApp''+0x28 [c:\...\localvariablename.cpp # 43]
03 0022fc18 014b5e8c LocalVariableName!_initterm+0x1a [f:\...\crt0dat.c # 955]
04 0022fc2c 014a9263 LocalVariableName!_cinit+0x6c [f:\...\crt0dat.c # 308]
05 0022fc78 014a947d LocalVariableName!__tmainCRTStartup+0xf3 [f:\...\crt0.c # 237]
06 0022fc80 7558336a LocalVariableName!wWinMainCRTStartup+0xd [f:\...\crt0.c # 165]
07 0022fc8c 771198f2 kernel32!BaseThreadInitThunk+0xe
08 0022fccc 771198c5 ntdll!__RtlUserThreadStart+0x70
09 0022fce4 00000000 ntdll!_RtlUserThreadStart+0x1b
0:000> .frame 1
01 0022fb3c 00f97958 LocalVariableName!CLocalVariableNameApp::CLocalVariableNameApp+0x90 [c:\...\localvariablename.cpp # 38]
0:000> dv
this = 0x01637118
holycow = 0x00445820
Note that the names a and temp are not visible.
Observation 2: variable names will be limited to their scope. If you want to remember all variable names, you would need to track all functions.
Above was a debug build. The release build is different:
0:000> k
# ChildEBP RetAddr
00 005efe04 002a1c18 KERNELBASE!DebugBreak+0x2
01 005efe20 002a1095 LocalVariableName!CLocalVariableNameApp::CLocalVariableNameApp+0x88 [c:\...\localvariablename.cpp # 39]
02 005efe24 003c0289 LocalVariableName!`dynamic initializer for 'theApp''+0x5 [c:\...\localvariablename.cpp # 43]
03 005efe38 003c01ea LocalVariableName!_initterm+0x29 [f:\...\crt0dat.c # 954]
04 005efe48 003bd280 LocalVariableName!_cinit+0x5a [f:\...\crt0dat.c # 321]
05 005efe88 7558336a LocalVariableName!__tmainCRTStartup+0xde [f:\...\crt0.c # 237]
06 005efe94 771198f2 kernel32!BaseThreadInitThunk+0xe
07 005efed4 771198c5 ntdll!__RtlUserThreadStart+0x70
08 005efeec 00000000 ntdll!_RtlUserThreadStart+0x1b
0:000> .frame 1
01 005efe20 002a1095 LocalVariableName!CLocalVariableNameApp::CLocalVariableNameApp+0x88 [c:\...\localvariablename.cpp # 39]
0:000> dv
this = 0x0043f420
Note that holycow is missing.
Observation 3: in release builds, variables might not be needed (optimized) and therefore not present.
Overall conclusion: it's not possible to map memory addresses to variable names.

Reliable Linux kernel timestamps (or adjustment thereof) with both usbmon and ftrace?

I'm trying to inspect a kernel module that utilizes usb, and so from the module itself I'm writing a message to ftrace using trace_printk; and then I wanted to inspect when does a USB Bulk Out URB Submit appear in the system after that.
The problem is that on my Ubuntu Lucid 11.04 (kernel 2.6.38-16), there are only local and global clocks in ftrace - and although their resolution is the same (microseconds) as the timestamps by usbmon, their values differ significantly.
So not knowing any better (as I couldn't find anywhere else talking about this), what I did was attempt to redirect usbmon to trace_marker, using cat:
# ... activate ftrace here ...
usbpid=$(sudo bash -c 'cat /sys/kernel/debug/usb/usbmon/2u > /sys/kernel/debug/tracing/trace_marker & echo $!')
sleep 3 # do test, etc.
sudo kill $usbpid
# ... deactivate ftrace here...
... and then, when I read from /sys/kernel/debug/tracing/trace, I get a log with problematic timestamps (see below). So what I'd like to know is:
Is there a way to make usbmon have it's messages appear directly in /debug/tracing/trace, instead of in /debug/usb/usbmon/2u ? (not that I can see, but I'd like to have this confirmed)
If not, is there a better way to "directly" redirect output of /sys/kernel/debug/usb/usbmon/2u without any possible overhead/buffering issues of cat and/or shell redirection?
If not, is there some sort of an algorithm, where I could use the extra usbmon timestamp, to somehow "correct" the position of these events in the kernel timestamp domain? (see example below)
Here is a brief example snippet of a /sys/kernel/debug/tracing/trace log I got:
<idle>-0 [000] 44989.403572: my_kernel_function: 1 00 00 64 1 64 5
<...>-29765 [000] 44989.403918: my_kernel_function: 1 00 00 64 2 128 2
<...>-29787 [000] 44989.404202: 0: f1f47280 3237249791 S Bo:2:002:2 -115 64 = 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<...>-29787 [000] 44989.404234: 0: f1f47080 3237250139 S Bo:2:002:2 -115 64 = 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<idle>-0 [000] 44989.404358: my_kernel_function: 1 00 00 64 3 192 4
<...>-29787 [000] 44989.404402: 0: f1f47c00 3237250515 S Bo:2:002:2 -115 64 = 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
So when the kernel timestamp is 44989.404202, the usbmon timestamp is 3237.249791 (= 3237249791/1e6); neither the seconds nor the microseconds part match. To make it a bit easier on the eyes, here's the same snippet with only time information remaining:
(1) 44989.403572 MYF 0
(2) 44989.403918 MYF 0.000346
(3) 44989.404202 USB | 0 3237.249791 0
(4) 44989.404234 USB | 0.000032 3237.250139 0.000348
(5) 44989.404358 MYF 0.000440 | |
(6) 44989.404402 USB 0.000168 3237.250515 0.000376
So judging by the kernel timestamps, 32 μs expired between event (3) and event (4) - but judging by the usbmon timestamps, 348 μs expired between the same events! Whom to trust now?!
Now, if we assume that the usbmon timestamps are more correct for those messages, given they got "printed" before they ended up in the ftrace buffer to begin with - we could assume that the first usb message (3) may have been scheduled right after (1) executed, but something preempted it - and so the second USB message (4) triggered the "printout" (or rather, "entry") of both (3) and (4) in the ftrace buffer (which is why their kernel timestamps are so clcse together?)
So, if I assume (4) is the more correct one, I can try push (3) back for 348 μs:
(1) 44989.403572 MYF 0
(3) 44989.403854 USB | 0 3237.249791 0
(2) 44989.403918 MYF 0.000346 | |
(4) 44989.404234 USB | 0.000380 3237.250139 0.000348
(5) 44989.404358 MYF 0.000440 | |
(6) 44989.404402 USB 0.000168 3237.250515 0.000376
... and that sort of looks better (also USB now fires 282 μs, 316 μs, and 44 μs after MYF) - for first and second MYF/USB pairs (if that is indeed how they behave); but then the third step doesn't really match, and so on... Cannot really think of an algorithm to help me adjust the USB events position according to the data in the usbmon timestamp...
While the best approach for redirecting usbmon output to ftrace is still an open question, I got an answer about correlating their timestamps from this thread:
Using both usbmon and ftrace? [linux-usb mailing list]
You can call the following
subroutine to get a usbmon-style timestamp value, which can then be
added to an ftrace message or simply printed in the kernel log:
#include <linux/time.h>
static unsigned usbmon_timestamp(void)
{
struct timeval tval;
unsigned stamp;
do_gettimeofday(&tval);
stamp = tval.tv_sec & 0xFFF;
stamp = stamp * 1000000 + tval.tv_usec;
return stamp;
}
For example,
pr_info("The usbmon time is: %u\n", usbmon_timestamp());

Resources