Parsing an ASCII text file using GAWK - bash

I have been trying to parse an ASCII text file of the following format --
0 0 0x2de0 [0x98]: PERF_RECORD_MMAP -1/0: [0xffffffffc06ae000(0x5000) # 0]: x /lib/modules/4.4.0-83-generic/kernel/net/ipv4/netfilter/nf_reject_ipv4.ko
0x2e78 [0x90]: event: 1
.
. ... raw event: size 144 bytes
. 0000: 01 00 00 00 01 00 90 00 ff ff ff ff 00 00 00 00 ................
. 0010: 00 30 6b c0 ff ff ff ff 00 50 00 00 00 00 00 00 .0k......P......
. 0020: 00 00 00 00 00 00 00 00 2f 6c 69 62 2f 6d 6f 64 ......../lib/mod
. 0030: 75 6c 65 73 2f 34 2e 34 2e 30 2d 38 33 2d 67 65 ules/4.4.0-83-ge
. 0040: 6e 65 72 69 63 2f 6b 65 72 6e 65 6c 2f 6e 65 74 neric/kernel/net
. 0050: 2f 69 70 76 34 2f 6e 65 74 66 69 6c 74 65 72 2f /ipv4/netfilter/
. 0060: 69 70 74 5f 52 45 4a 45 43 54 2e 6b 6f 00 2e 6b ipt_REJECT.ko..k
. 0070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
. 0080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0 0 0x2e78 [0x90]: PERF_RECORD_MMAP -1/0: [0xffffffffc06b3000(0x5000) # 0]: x /lib/modules/4.4.0-83-generic/kernel/net/ipv4/netfilter/ipt_REJECT.ko
0x2f08 [0x88]: event: 1
.
. ... raw event: size 136 bytes
. 0000: 01 00 00 00 01 00 88 00 ff ff ff ff 00 00 00 00 ................
. 0010: 00 80 6b c0 ff ff ff ff 00 50 00 00 00 00 00 00 ..k......P......
. 0020: 00 00 00 00 00 00 00 00 2f 6c 69 62 2f 6d 6f 64 ......../lib/mod
. 0030: 75 6c 65 73 2f 34 2e 34 2e 30 2d 38 33 2d 67 65 ules/4.4.0-83-ge
. 0040: 6e 65 72 69 63 2f 6b 65 72 6e 65 6c 2f 6e 65 74 neric/kernel/net
. 0050: 2f 6e 65 74 66 69 6c 74 65 72 2f 78 74 5f 74 63 /netfilter/xt_tc
. 0060: 70 75 64 70 2e 6b 6f 00 00 00 00 00 00 00 00 00 pudp.ko.........
. 0070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
. 0080: 00 00 00 00 00 00 00 00
........[some other data]........
0x11590 [0x30]: PERF_RECORD_AUXTRACE size: 0x2002a0 offset: 0 ref: 0x2d44e6441a3c2 idx: 0 tid: -1 cpu: 0
.
. ... Intel Processor Trace data: size 2097824 bytes
. 00000000: 02 82 02 82 02 82 02 82 02 82 02 82 02 82 02 82 PSB
. 00000010: 00 00 00 PAD
. 00000013: 99 20 MODE.TSX TXAbort:0 InTX:0
. 00000015: 99 01 MODE.Exec 64
. 00000017: 7d 08 45 06 81 ff ff 00 FUP 0xffff81064508
. 0000001f: 00 00 00 00 00 00 00 PAD
. 00000026: 02 43 00 76 49 1f 00 00 PIP 0xfa4bb00 (NR=0)
. 0000002e: 00 00 00 00 00 00 00 00 PAD
--- continued ---
The file will have several headers - as you can see in my snippet here.
PERF_RECORD_MMAP and PERF_RECORD_AUXTRACE
There will be other headers in the file as well.
What I want is that all the headers having PERF_RECORD_AUXTRACE in my text file should only be considered. All the data following the PERF_RECORD_AUXTRACE in my file should only be collected (i.e. all of the data starting with Intel Processor Trace Data). The PERF_RECORD_AUXTRACE header also has a size field with the use of which I can specify how much of data is there to be collected within the PERF_RECORD_AUXTRACE header.
Edit #1 :
So basically, given the above input file snippet, I want the output to be of the following form (all the lines after record containing PERF_RECORD_AUXTRACE)...
.
. ... Intel Processor Trace data: size 2097824 bytes
. 00000000: 02 82 02 82 02 82 02 82 02 82 02 82 02 82 02 82 PSB
. 00000010: 00 00 00 PAD
. 00000013: 99 20 MODE.TSX TXAbort:0 InTX:0
. 00000015: 99 01 MODE.Exec 64
. 00000017: 7d 08 45 06 81 ff ff 00 FUP 0xffff81064508
. 0000001f: 00 00 00 00 00 00 00 PAD
. 00000026: 02 43 00 76 49 1f 00 00 PIP 0xfa4bb00 (NR=0)
. 0000002e: 00 00 00 00 00 00 00 00 PAD
--- continued ---
EDIT #2 : This is another requirement that I have --
If I have an input snippet like below --
0 0 0x230 [0x60]: PERF_RECORD_MMAP -1/0: [0xffffffff81000000(0x3f000000) # 0xffffffff81000000]: x [kernel.kallsyms]_text
0x290 [0x88]: event: 1
.
. ... raw event: size 136 bytes
. 0000: 01 00 00 00 01 00 88 00 ff ff ff ff 00 00 00 00 ................
. 0010: 00 00 00 c0 ff ff ff ff 00 90 00 00 00 00 00 00 ................
. 0020: 00 00 00 00 00 00 00 00 2f 6c 69 62 2f 6d 6f 64 ......../lib/mod
. 0030: 75 6c 65 73 2f 34 2e 34 2e 30 2d 38 33 2d 67 65 ules/4.4.0-83-ge
. 0040: 6e 65 72 69 63 2f 6b 65 72 6e 65 6c 2f 64 72 69 neric/kernel/dri
. 0050: 76 65 72 73 2f 61 74 61 2f 6c 69 62 61 68 63 69 vers/ata/libahci
. 0060: 2e 6b 6f 00 00 00 00 00 00 00 00 00 00 00 00 00 .ko.............
. 0070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
. 0080: 00 00 00 00 00 00 00 00 ........
0x11590 [0x30]: PERF_RECORD_AUXTRACE size: 0x2002a0 offset: 0 ref: 0x2d44e6441a3c2 idx: 0 tid: -1 cpu: 0
.
. ... Intel Processor Trace data: size 2097824 bytes
. 00000000: 02 82 02 82 02 82 02 82 02 82 02 82 02 82 02 82 PSB
. 00000010: 00 00 00 PAD
. 00000013: 99 20 MODE.TSX TXAbort:0 InTX:0
. 00000015: 99 01 MODE.Exec 64
. 00000017: 7d 08 45 06 81 ff ff 00 FUP 0xffff81064508
. 0000001f: 00 00 00 00 00 00 00 PAD
. 00000026: 02 43 00 76 49 1f 00 00 PIP 0xfa4bb00 (NR=0)
. 0000002e: 00 00 00 00 00 00 00 00 PAD
. 00000036: 02 c8 c2 3a 7c 00 00 00 VMCS 0x7c3ac2
0 0 0x290 [0x88]: PERF_RECORD_MMAP -1/0: [0xffffffffc0000000(0x9000) # 0]: x /lib/modules/4.4.0-83-generic/kernel/drivers/ata/libahci.ko
0x318 [0x98]: event: 1
.
. ... raw event: size 152 bytes
. 0000: 01 00 00 00 01 00 98 00 ff ff ff ff 00 00 00 00 ................
. 0010: 00 90 00 c0 ff ff ff ff 00 50 00 00 00 00 00 00 .........P......
. 0020: 00 00 00 00 00 00 00 00 2f 6c 69 62 2f 6d 6f 64 ......../lib/mod
. 0030: 75 6c 65 73 2f 34 2e 34 2e 30 2d 38 33 2d 67 65 ules/4.4.0-83-ge
. 0040: 6e 65 72 69 63 2f 6b 65 72 6e 65 6c 2f 64 72 69 neric/kernel/dri
. 0050: 76 65 72 73 2f 76 69 64 65 6f 2f 66 62 64 65 76 vers/video/fbdev
. 0060: 2f 63 6f 72 65 2f 66 62 5f 73 79 73 5f 66 6f 70 /core/fb_sys_fop
. 0070: 73 2e 6b 6f 00 00 00 00 00 00 00 00 00 00 00 00 s.ko............
. 0080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
. 0090: 00 00 00 00 00 00 00 00 ........
0x11590 [0x30]: PERF_RECORD_AUXTRACE size: 0x2002a0 offset: 0 ref: 0x2d44e6441a3c2 idx: 0 tid: -1 cpu: 0
.
. ... Intel Processor Trace data: size 2097824 bytes
. 00000000: 02 82 02 82 02 82 02 82 02 82 02 82 02 82 02 82 PSB
. 00000010: 00 00 00 PAD
. 00000013: 99 20 MODE.TSX TXAbort:0 InTX:0
. 00000015: 99 01 MODE.Exec 64
. 00000017: 7d 08 45 06 81 ff ff 00 FUP 0xffff81064508
. 0000001f: 00 00 00 00 00 00 00 PAD
. 00000026: 02 43 00 76 49 1f 00 00 PIP 0xfa4bb00 (NR=0)
. 0000002e: 00 00 00 00 00 00 00 00 PAD
. 00000036: 02 c8 c2 3a 7c 00 00 00 VMCS 0x7c3ac2
I only would need the data under the records containing PERF_RECORD_AUXTRACE just like this. It would be great if the first line that contains
Intel Processor Trace Data : size 2097824 bytes
can also be avoided from my output.
. 00000000: 02 82 02 82 02 82 02 82 02 82 02 82 02 82 02 82 PSB
. 00000010: 00 00 00 PAD
. 00000013: 99 20 MODE.TSX TXAbort:0 InTX:0
. 00000015: 99 01 MODE.Exec 64
. 00000017: 7d 08 45 06 81 ff ff 00 FUP 0xffff81064508
. 0000001f: 00 00 00 00 00 00 00 PAD
. 00000026: 02 43 00 76 49 1f 00 00 PIP 0xfa4bb00 (NR=0)
. 0000002e: 00 00 00 00 00 00 00 00 PAD
. 00000000: 02 82 02 82 02 82 02 82 02 82 02 82 02 82 02 82 PSB
. 00000010: 00 00 00 PAD
. 00000013: 99 20 MODE.TSX TXAbort:0 InTX:0
. 00000015: 99 01 MODE.Exec 64
. 00000017: 7d 08 45 06 81 ff ff 00 FUP 0xffff81064508
. 0000001f: 00 00 00 00 00 00 00 PAD
. 00000026: 02 43 00 76 49 1f 00 00 PIP 0xfa4bb00 (NR=0)
. 0000002e: 00 00 00 00 00 00 00 00 PAD
Edit #3 : This is what I initially tried to do.. but which obviously does not work!
cat "$file" | gawk -F' ' -- '
/PERF_RECORD_AUXTRACE / {
offset = strtonum($1)
hsize = strtonum(substr($2, 2))
size = strtonum($5)
idx = strtonum($11)
ext = ""
ofile = sprintf("raw-pt.txt")
begin = offset + hsize
cmd = sprintf("dd if=%s of=%s conv=notrunc oflag=append ibs=1 " \
"count=%d status=none", file, ofile, size)
#!cmd = sprintf("sed p")
if (dry_run != 0) {
print cmd
}
else {
system(cmd)
}
}
I am not quite sure how can I properly parse this file to exactly get what I want. I also am not sure if using Python would help.
How to resolve this ?

To get the output you say you want from the input you posted is just:
awk 'f; /PERF_RECORD_AUXTRACE/{f=1}' file
If that's not actually all you want then edit your question to clarify your requirements and provide different sample input/output that more truly demonstrates your problem if necessary.

Related

how do i read this file user data

how do i read this file?
14 00 08 00 00 00 0D 30 D5 4E 00 00 00 00 2A 01 00 00 2A 01 00 00 37 00 00 00 50 69 63 74 75 72 65 73 2F 52 65 73 74 6F 72 65 64 2F 50 69 63 74 75 72 65 73 2F 6E 6F 6D 6F 72 2F 33 44 20 4F 62 6A 65 63 74 73 2F 64 65 73 6B 74 6F 70 2E 69 6E 69 FF FE 0D 00 0A 00 5B 00 2E 00 53 00 68 00 65 00 6C 00 6C 00 43 00 6C 00 61 00 73 00 73 00 49 00 6E 00 66 00 6F 00 5D 00 0D 00 0A 00 4C 00 6F 00 63 00 61 00 6C 00 69 00 7A 00 65 00 64 00 52 00 65 00 73 00 6F 00 75 00 72 00 63 00 65 00 4E 00 61 00 6D 00 65 00 3D 00 40 00 25 00 53 00 79 00 73 00 74 00 65 00 6D 00 52 00 6F 00 6F 00 74 00 25 00 5C 00 73 00 79 00 73 00 74 00 65 00 6D 00 33 00 32 00 5C 00 77 00 69 00 6E 00 64 00 6F 00 77 00 73 00 2E 00 73 00 74 00 6F 00 72 00 61 00 67 00 65 00 2E 00 64 00 6C 00 6C 00 2C 00 2D 00 32 00 31 00 38 00 32 00 35 00 0D 00 0A 00 49 00 63 00 6F 00 6E 00 52 00 65 00 73 00 6F 00 75 00 72 00 63 00 65 00 3D 00 25 00 53 00 79 00 73 00 74 00 65 00 6D 00 52 00 6F 00 6F 00 74 00 25 00 5C 00 73 00 79 00 73 00 74 00 65 00 6D 00 33 00 32 00 5C 00 69 00 6D 00 61 00 67 00 65 00 72 00 65 00 73 00 2E 00 64 00 6C 00 6C 00 2C 00 2D 00 31 00 39 00 38 00 0D 00 0A 00 50 4B 07 08 90 4D F9 6E 2A 01 00 00 2A 01 00 00 50 4B 03 04 14 00 08 00 00 00 0D 30 D5 4E 00 00 00 00 153 01 00 00 153 01 00 00 35 00 00 00 50 69 63 74 75 72 65 73 2F 52 65 73 74 6F 72 65 64 2F 50 69 63 74 75 72 65 73 2F 6E 6F 6D 6F 72 2F 43 6F 6E 74 61 63 74 73 2F 64 65 73 6B 74 6F 70 2E 69 6E 69 FF FE 0D 00 0A 00 5B 00 2E 00 53 00 68 00 65 00 6C 00 6C 00 43 00 6C 00 61 00 73 00 73 00 49 00 6E 00 66 00 6F 00 5D 00 0D 00 0A 00 4C 00 6F 00 63 00 61 00 6C 00 69 00 7A 00 65 00 64 00 52 00 65 00 73 00 6F 00 75 00 72 00 63 00 65 00 4E 00 61 00 6D 00 65 00 3D 00 40 00 25 00 43 00 6F 00 6D 00 6D 00 6F 00 6E 00 50 00 72 00 6F 00 67 00 72 00 61 00 6D 00 46 00 69 00 6C##
It is encoded in hexadecimal.
You could use for example an online converter to display the text.
Convert hexadecimal to text

Outlook calendar item clipboard format documentation?

Short Version
Is there any documentation on the Outlook RenPrivateAppointment clipboard format used to transfer appointments?
Long version
As a reminder, for anything on the clipboard, the source application can present you the data in a number of different formats. The receiver can go through the list, in order, and decide which format it understands the best.
In the case of my Outlook appointment, the formats are:
0: "RenPrivateSourceFolder" (IStream)
1: "RenPrivateMessages" (IStream)
2: "RenPrivateItem" (HGlobal)
3: "FileGroupDescriptor" (HGlobal)
4: CFSTR_FILEDESCRIPTOR (HGlobal)
5: CFSTR_FILENAME (File)
6: CFSTR_FILECONTENTS (IStream, IStorage)
7: "Object Descriptor" (HGlobal)
8: "RenPrivateAppointment" (IStream)
9: CF_TEXT (HGlobal)
10: CF_UNICODETEXT (HGlobal)
Looking at the content of the various formats, the most promising looks like the RenPrivateAppointment format:
01 00 00 00 C0 C8 1E 0D 60 CE 1E 0D 01 00 00 00 ....ÀÈ.`Î......
6A CB 1E 0D 79 CB 1E 0D 41 00 00 00 41 73 6B 20 jË..yË..A...Ask
71 75 65 73 74 69 6F 6E 20 61 62 6F 75 74 20 61 question about a
70 70 6F 69 6E 74 6D 65 6E 74 20 63 6C 69 70 62 ppointment clipb
6F 61 72 64 20 66 6F 72 6D 61 74 20 6F 6E 20 53 oard format on S
74 61 63 6B 6F 76 65 72 66 6C 6F 77 00 02 00 00 tackoverflow...
00 02 00 00 00 18 00 00 00 00 00 00 00 BC B9 6E ............¼¹n
9C 12 F8 D3 43 AC B7 74 81 5E F0 3D FC 04 D2 97 œ.øÓC¬·t.^ð=ü.Ò—
00 00 00 00 00 00 00 00 00 02 00 00 00 00 00 00 ...............
00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 FF 92 81 02 41 00 73 00 6B 00 20 00 71 00 75 .ÿ’.A.s.k. .q.u
00 65 00 73 00 74 00 69 00 6F 00 6E 00 20 00 61 .e.s.t.i.o.n. .a
00 62 00 6F 00 75 00 74 00 20 00 61 00 70 00 70 .b.o.u.t. .a.p.p
00 6F 00 69 00 6E 00 74 00 6D 00 65 00 6E 00 74 .o.i.n.t.m.e.n.t
00 20 00 63 00 6C 00 69 00 70 00 62 00 6F 00 61 . .c.l.i.p.b.o.a
00 72 00 64 00 20 00 66 00 6F 00 72 00 6D 00 61 .r.d. .f.o.r.m.a
00 74 00 20 00 6F 00 6E 00 20 00 53 00 74 00 61 .t. .o.n. .S.t.a
00 63 00 6B 00 6F 00 76 00 65 00 72 00 66 00 6C .c.k.o.v.e.r.f.l
00 6F 00 77 00 00 00 01 00 00 00 00 00 FF FF FF .o.w.........ÿÿÿ
FF ÿ
Some of this can be interpreted:
Clipboard format "RenPrivateAppointment"
01 00 00 00 ; always 0x00000001 (Version 1?)
C0 C8 1E 0D ; Start day of appt. minutes from 1/1/1601 0x0D1EC8C0 = 220,121,280 minutes = 7/11/2019 12:00 am
60 CE 1E 0D ; End day of appt. minutes from 1/1/1601 0x0D1ECE60 = 220,122,720 minutes = 7/12/2019 12:00 am
01 00 00 00 ; 0x00000001 (fixed)
6A CB 1E 0D ; Start of appt. minutes from 1/1/1601 0x0D1ECB6A = 220,121,962 minutes = 7/11/2019 11:22 am
79 CB 1E 0D ; End of appt. minutes from 1/1/1601 0x0D1ECB79 = 220,121,977 minutes = 7/11/2019 11:37 am
; "Ask question about appointment clipboard format on Stackoverflow.\0"
41 00 00 00 ; String length prefix, including null terminator (0x00000041 = 65 characters)
41 73 6B 20 71 75 65 73 Ask ques
74 69 6F 6E 20 61 62 6F tion abo
75 74 20 61 70 70 6F 69 ut appoi
6E 74 6D 65 6E 74 20 63 ntment c
6C 69 70 62 6F 61 72 64 lipboard
20 66 6F 72 6D 61 74 20 format
6F 6E 20 53 74 61 63 6B on Stack
6F 76 65 72 66 6C 6F 77 overflow
00 .
02 00 00 00 ; 0x0000002 = 2
02 00 00 00 ; 0x0000002 = 2
18 00 00 00 ; 0x00000018 = 24
00 00 00 00 ; 0x00000000 = 0
BC B9 6E 9C 12 F8 D3 43 ; always
AC B7 74 81 5E F0 3D FC ; always
04 D2 97 00 ; varies (~32 ticks per day) 0x0097D204 = 9,949,700
00 00 00 00
00 00 00 00
02 00 00 00 ; 0x00000002 = 2
00 00 00 00
01 00 00 00 ; 0x00000001 = 1
00 00 00 00
00 00 00 00
00 00 00 00
FF 92 81 02 ; always 0x028192FF
; N"Ask question about appointment clipboard format on Stackoverflow\0"
41 00 73 00 6B 00 20 00 71 00 75 00 65 00 73 00 A.s.k. .q.u.e.s.
74 00 69 00 6F 00 6E 00 20 00 61 00 62 00 6F 00 t.i.o.n. .a.b.o.
75 00 74 00 20 00 61 00 70 00 70 00 6F 00 69 00 u.t. .a.p.p.o.i.
6E 00 74 00 6D 00 65 00 6E 00 74 00 20 00 63 00 n.t.m.e.n.t. .c.
6C 00 69 00 70 00 62 00 6F 00 61 00 72 00 64 00 l.i.p.b.o.a.r.d.
20 00 66 00 6F 00 72 00 6D 00 61 00 74 00 20 00 .f.o.r.m.a.t. .
6F 00 6E 00 20 00 53 00 74 00 61 00 63 00 6B 00 o.n. .S.t.a.c.k.
6F 00 76 00 65 00 72 00 66 00 6C 00 6F 00 77 00 o.v.e.r.f.l.o.w.
00 00 ..
01 00 ; padding to DWORD
00 00 00 00
FF FF FF FF ; footer
Is there any documentation on RenPrivateAppointment, or any other the other formats that would allow rich interactions by the user?
Note: This is not automating Outlook. This is handling the IDataObject placed on the clipboard by Outlook. I want to retrieve:
start time
end time
description
See also
C# parse outlook calendar item (i'm not in C#)
microsoft.public.win32.programmer.ole: Identify correctly outlook items in Drag and Drop.
There is a project on GitHub that parses the RenPrivateAppointment clipboard format: https://github.com/yasoonOfficial/outlook-dndprotocol
The RenPrivateAppointment format isn't documented. You may read about that on the DragDrop Event in Outlook Calendar thread which has an official comment from a VSTO team member. Also, you may take a look at the Drag and Drop with Outlook page.

How to extract bitmap images from .slide file generated by a CytoVision Platform

I am working with neural network to classify images.
I have some files generated by a CytoVision Platform. I would like to use the images in those files but I need to extract them somehow.
These .slide files contain several images of apparently 16kb each one.
I have developed a program in C that I am currently running on linux to extract each 16kb in files. I should build a header in order to use those images.
I don't know which format they have.
If I look at the entire file as a bitmap with FileAlyzer I can see this:
File as a bitmap
This link should allow anyone to download an example file:
https://ufile.io/2ibdq
This is what it seems to be one image header:
42 4D 31 00 00 00 00 00 40 8F 40 05 00 9E 5F 98 D7 47 60 A1 40 01 04 4D 65 74 31 00 00 00 00 00 40 8F 40 05 00 64 31 2E 29 B5 46 DC 40 01 04 4D 65 74 32 00 00 00 00 00 40 8F 40 05 00 87 7D 26 70 88 C0 C5 40 01 04 4D 65 74 33 00 00 00 00 00 40 8F 40 05 00 C8 97 53 05 BB 0D 0F 41 01 04 54 65 78 31 00 00 00 00 00 00 D0 40 05 00 00 00 00 00 00 40 5C 40 07 04 54 65 78 32 00 00 00 00 00 00 D0 40 05 00 00 00 00 00 00 00 44 40 07 04 54 65 78 33 00 00 00 00 00 00 D0 40 05 00 00 00 00 00 00 90 76 40 07 04 54 65 78 34 00 00 00 00 00 00 D0 40 05 00 00 00 00 00 00 F4 CD 40 07 0A 43 68 72 6F 6D 73 41 72 65 61 00 00 00 00 00 4C BD 40 05 00 F3 76 84 D3 82 85 74 40 07 08 42 6F 75 6E 64 61 72 79 00 00 00 00 00 88 B3 40 05 00 D9 CE F7 53 E3 AD 7E 40 07 04 41 72 65 61 00 00 00 00 00 88 B3 40 05 00 20 EF 55 2B 13 0B 85 40 07 07 4F 62 6A 65 63 74 73 00 00 00 00 00 00 69 40 05 00 00 00 00 00 00 00 18 40 03 04 43 69 72 63 00 00 00 00 00 40 8F 40 05 00 9D E5 51 0E 5C 34 65 40 03 03 42 47 52 00 00 00 00 00 40 8F 40 05 00 7D 0C CE C7 E0 AC 86 40 03 04 54 65 78 35 00 00 00 00 00 00 D0 40 05 00 00 00 00 00 00 00 53 40 07 04 41 52 41 54 00 00 00 00 00 40 8F 40 05 00 86 89 F7 23 A7 79 7E 40 07 05 43 6C 61 73 73 00 00 00 00 00 00 F0 3F 05 00 00 00 00 00 00 00 F0 BF 00 01 00 00 00 01 00 00 00
With notepad++ I can see the previous hex like this:
BM1 #? ??G`?Met1 #? d1.)??Met2 #? ?&p?bMet3 #? ?S?ATex1 ? #\#Tex2 ? D#Tex3 ? ?#Tex4 ? ??
ChromsArea L? ???t#Boundary ?# ???~#Area ?# ?bObjects i# #Circ #? ?Q\4e#BGR #? }??#Tex5 ? S#ARAT #? Ð??~#Class ?? ?? #
Hope someone can give me an idea about the format of the images and what info I can extract from the header.

Conflicting answers when I try to find out endian-ness of my Macbook Pro

I have a Macbook Pro and am getting contradictory answers when I try to determine its endian-ness.
Method 1
python -c "import sys;print sys.byteorder" tells me I am on a little endian system
Method 2
I have a text file. I used iconv to convert it into UTF16. Its supposed to detect the endianness of the computer and convert it into that format. So here I go:
iconv -f utf-8 -t utf-16 file.txt > utf16.txt
file utf16.txt
utf16.txt: Big-endian UTF-16 Unicode English text
vi utf16.txt works and hexdump -C utf16.txt shows:
00000000 fe ff 00 33 00 39 00 38 00 31 00 36 00 30 00 38 |...3.9.8.1.6.0.8|
00000010 00 09 00 54 00 69 00 61 00 20 00 4a 00 75 00 61 |...T.i.a. .J.u.a|
00000020 00 6e 00 61 00 20 00 52 00 69 00 76 00 65 00 72 |.n.a. .R.i.v.e.r|
00000030 00 09 00 54 00 69 00 61 00 20 00 4a 00 75 00 61 |...T.i.a. .J.u.a|
00000040 00 6e 00 61 00 20 00 52 00 69 00 76 00 65 00 72 |.n.a. .R.i.v.e.r|
00000050 00 09 00 52 00 69 00 6f 00 20 00 54 00 69 00 61 |...R.i.o. .T.i.a|
00000060 00 6a 00 75 00 61 00 6e 00 61 00 2c 00 52 00 69 |.j.u.a.n.a.,.R.i|
00000070 00 6f 00 20 00 54 00 69 00 6a 00 75 00 61 00 6e |.o. .T.i.j.u.a.n|
00000080 00 61 00 2c 00 52 00 ed 00 6f 00 20 00 54 00 69 |.a.,.R...o. .T.i|
00000090 00 6a 00 75 00 61 00 6e 00 61 00 2c 00 54 00 69 |.j.u.a.n.a.,.T.i|
if I convert it to little-endian and manually insert a BOM like this:
( printf "\xff\xfe" ; iconv -f utf-8 -t utf-16le file.txt ) > UTF16LEBOM.txt
file UTF16LEBOM.txt
UTF16LEBOM.txt: Little-endian UTF-16 Unicode English text
vi UTF16LEBOM.txt works
and hexdump -C UTF16LEBOM.txt shows
00000000 ff fe 33 00 39 00 38 00 31 00 36 00 30 00 38 00 |..3.9.8.1.6.0.8.|
00000010 09 00 54 00 69 00 61 00 20 00 4a 00 75 00 61 00 |..T.i.a. .J.u.a.|
00000020 6e 00 61 00 20 00 52 00 69 00 76 00 65 00 72 00 |n.a. .R.i.v.e.r.|
00000030 09 00 54 00 69 00 61 00 20 00 4a 00 75 00 61 00 |..T.i.a. .J.u.a.|
00000040 6e 00 61 00 20 00 52 00 69 00 76 00 65 00 72 00 |n.a. .R.i.v.e.r.|
00000050 09 00 52 00 69 00 6f 00 20 00 54 00 69 00 61 00 |..R.i.o. .T.i.a.|
00000060 6a 00 75 00 61 00 6e 00 61 00 2c 00 52 00 69 00 |j.u.a.n.a.,.R.i.|
00000070 6f 00 20 00 54 00 69 00 6a 00 75 00 61 00 6e 00 |o. .T.i.j.u.a.n.|
00000080 61 00 2c 00 52 00 ed 00 6f 00 20 00 54 00 69 00 |a.,.R...o. .T.i.|
00000090 6a 00 75 00 61 00 6e 00 61 00 2c 00 54 00 69 00 |j.u.a.n.a.,.T.i.|
From this link:
The other approach is to include a magic number, such as 0xFEFF,
before every piece of data. If you read the magic number and it is
0xFEFF, it means the data is in the same format as your machine, and
all is well.
If you read the magic number and it is 0xFFFE (it is backwards), it
means the data was written in a format different from your own. You'll
have to translate it.
Who is right and why am I getting contradictory answers?
"Endian-ness of my Macbook Pro" means nothing. You need to be more specific; different applications will have different impressions. As you've just seen, you can arbitrarily encode bytes in a file. In the end, a series of bytes is just that, and files are ultimately just a series of bytes that can be read in either fashion. In the context of programming (Stack Overflow) what's important is knowing a) Whether the input you are getting is in Big Endian or Little Endian, and b) Whether the output you send should be in Big Endian or Little Endian.
If your question is the conventional reading of files, the answer is usually Little Endian. But, for example, network data tends to be Big Endian.

Foreign Characters Appearing In Git-Managed Files

I am using git 1.7.2.3 via cygwin on Windows 7 and seeing strange artifacts appearing in some of my source files when switching branches. git status reports everything as unchanged yet they crazy characters are present. I've confirmed on GitHub that the files are as they should be in the repo.
My Copy:
਍        ⼀⼀⼀ 㰀猀甀洀洀愀爀礀㸀ഀഀ
/// Set up method.
਍        ⼀⼀⼀ 㰀⼀猀甀洀洀愀爀礀㸀ഀഀ
[SetUp]
਍        瀀甀戀氀椀挀 漀瘀攀爀爀椀搀攀 瘀漀椀搀 匀攀琀甀瀀⠀⤀ഀഀ
{
਍            琀栀椀猀⸀匀挀漀瀀攀 㴀 渀攀眀 吀爀愀渀猀愀挀琀椀漀渀匀挀漀瀀攀⠀⤀㬀ഀഀ
਍            琀栀椀猀⸀琀攀猀琀䤀琀攀洀 㴀 渀攀眀 嘀椀攀眀䐀漀挀甀洀攀渀琀䠀椀猀琀漀爀礀⠀ ഀഀ
625016,
਍                㔀㜀㤀㤀㘀Ⰰ ഀഀ
'T',
਍                ㌀㐀㠀㌀㔀㈀㤀Ⰰ ഀഀ
DateTime.Parse("2003-01-08 09:57:04.957"),
਍                ㌀Ⰰ ഀഀ
"Invoice (PG-PS) - SUPP(11/16/2008)",
਍                ∀䘀䤀一䄀一䌀䔀∀Ⰰ ഀഀ
DateTime.Parse("2008-04-11 11:15:07.770"),
਍                䀀∀尀尀䐀伀匀䬀尀䌀䜀䐀伀䌀匀尀㌀㜀㐀㤀㄀㐀尀㐀㘀 㐀㘀尀戀椀氀猀氀椀瀀开㄀ 㠀㄀㘀㐀㠀⸀搀漀挀∀⤀㬀ഀഀ
}
Repo Copy:
/// <summary>
/// Set up method.
/// </summary>
[SetUp]
public override void Setup()
{
this.Scope = new TransactionScope();
this.testItem = new ViewDocumentHistory(
625016,
57996,
'T',
3483529,
DateTime.Parse("2003-01-08 09:57:04.957"),
3,
"Invoice (PG-PS) - SUPP(11/16/2008)",
"FINANCE",
DateTime.Parse("2008-04-11 11:15:07.770"),
#"\\DOSK\CGDOCS\374914\46046\bilslip_1081648.doc");
}
I'm also using a .gitattributes file to ensure line endings are correct since we are developing on Windows.
*.cs eol=crlf text
*.csproj eol=crlf text
*.sln eol=crlf text
*.xml eol=crlf text
The text is an addition by me to attempt to fix the problem as git diff was interpreting the file as binary when I modified it. Didn't have any effect.
This also occurs on fresh checkouts in 1.7.2.3 but not in 1.6.5.1 (mysysgit) as far as I can tell. The caveat is that 1.6 doesn't support .gitattributes which I need for working on Windows. This seems to be a fairly new bug and I haven't changed any configuration.
Does anyone have any idea what could be causing this?
edit:
hexdump -C ViewDocumentHistoryTests.cs | sed -n "130,212p"
000008d0 00 20 00 20 00 2f 00 2f 00 2f 00 20 00 3c 00 73 |. . ./././. .<.s|
000008e0 00 75 00 6d 00 6d 00 61 00 72 00 79 00 3e 00 0d |.u.m.m.a.r.y.>..|
000008f0 00 0d 0a 00 20 00 20 00 20 00 20 00 20 00 20 00 |.... . . . . . .|
00000900 20 00 20 00 2f 00 2f 00 2f 00 20 00 53 00 65 00 | . ./././. .S.e.|
00000910 74 00 20 00 75 00 70 00 20 00 6d 00 65 00 74 00 |t. .u.p. .m.e.t.|
00000920 68 00 6f 00 64 00 2e 00 0d 00 0d 0a 00 20 00 20 |h.o.d........ . |
00000930 00 20 00 20 00 20 00 20 00 20 00 20 00 2f 00 2f |. . . . . . ././|
00000940 00 2f 00 20 00 3c 00 2f 00 73 00 75 00 6d 00 6d |./. .<./.s.u.m.m|
00000950 00 61 00 72 00 79 00 3e 00 0d 00 0d 0a 00 20 00 |.a.r.y.>...... .|
00000960 20 00 20 00 20 00 20 00 20 00 20 00 20 00 5b 00 | . . . . . . .[.|
00000970 53 00 65 00 74 00 55 00 70 00 5d 00 0d 00 0d 0a |S.e.t.U.p.].....|
00000980 00 20 00 20 00 20 00 20 00 20 00 20 00 20 00 20 |. . . . . . . . |
00000990 00 70 00 75 00 62 00 6c 00 69 00 63 00 20 00 6f |.p.u.b.l.i.c. .o|
000009a0 00 76 00 65 00 72 00 72 00 69 00 64 00 65 00 20 |.v.e.r.r.i.d.e. |
000009b0 00 76 00 6f 00 69 00 64 00 20 00 53 00 65 00 74 |.v.o.i.d. .S.e.t|
000009c0 00 75 00 70 00 28 00 29 00 0d 00 0d 0a 00 20 00 |.u.p.(.)...... .|
000009d0 20 00 20 00 20 00 20 00 20 00 20 00 20 00 7b 00 | . . . . . . .{.|
000009e0 0d 00 0d 0a 00 20 00 20 00 20 00 20 00 20 00 20 |..... . . . . . |
000009f0 00 20 00 20 00 20 00 20 00 20 00 20 00 74 00 68 |. . . . . . .t.h|
00000a00 00 69 00 73 00 2e 00 53 00 63 00 6f 00 70 00 65 |.i.s...S.c.o.p.e|
00000a10 00 20 00 3d 00 20 00 6e 00 65 00 77 00 20 00 54 |. .=. .n.e.w. .T|
00000a20 00 72 00 61 00 6e 00 73 00 61 00 63 00 74 00 69 |.r.a.n.s.a.c.t.i|
00000a30 00 6f 00 6e 00 53 00 63 00 6f 00 70 00 65 00 28 |.o.n.S.c.o.p.e.(|
00000a40 00 29 00 3b 00 0d 00 0d 0a 00 0d 00 0d 0a 00 20 |.).;........... |
00000a50 00 20 00 20 00 20 00 20 00 20 00 20 00 20 00 20 |. . . . . . . . |
00000a60 00 20 00 20 00 20 00 74 00 68 00 69 00 73 00 2e |. . . .t.h.i.s..|
00000a70 00 74 00 65 00 73 00 74 00 49 00 74 00 65 00 6d |.t.e.s.t.I.t.e.m|
00000a80 00 20 00 3d 00 20 00 6e 00 65 00 77 00 20 00 56 |. .=. .n.e.w. .V|
00000a90 00 69 00 65 00 77 00 44 00 6f 00 63 00 75 00 6d |.i.e.w.D.o.c.u.m|
00000aa0 00 65 00 6e 00 74 00 48 00 69 00 73 00 74 00 6f |.e.n.t.H.i.s.t.o|
00000ab0 00 72 00 79 00 28 00 20 00 0d 00 0d 0a 00 20 00 |.r.y.(. ...... .|
00000ac0 20 00 20 00 20 00 20 00 20 00 20 00 20 00 20 00 | . . . . . . . .|
00000ad0 20 00 20 00 20 00 20 00 20 00 20 00 20 00 36 00 | . . . . . . .6.|
00000ae0 32 00 35 00 30 00 31 00 36 00 2c 00 20 00 0d 00 |2.5.0.1.6.,. ...|
00000af0 0d 0a 00 20 00 20 00 20 00 20 00 20 00 20 00 20 |... . . . . . . |
00000b00 00 20 00 20 00 20 00 20 00 20 00 20 00 20 00 20 |. . . . . . . . |
00000b10 00 20 00 35 00 37 00 39 00 39 00 36 00 2c 00 20 |. .5.7.9.9.6.,. |
00000b20 00 0d 00 0d 0a 00 20 00 20 00 20 00 20 00 20 00 |...... . . . . .|
00000b30 20 00 20 00 20 00 20 00 20 00 20 00 20 00 20 00 | . . . . . . . .|
00000b40 20 00 20 00 20 00 27 00 54 00 27 00 2c 00 20 00 | . . .'.T.'.,. .|
00000b50 0d 00 0d 0a 00 20 00 20 00 20 00 20 00 20 00 20 |..... . . . . . |
00000b60 00 20 00 20 00 20 00 20 00 20 00 20 00 20 00 20 |. . . . . . . . |
00000b70 00 20 00 20 00 33 00 34 00 38 00 33 00 35 00 32 |. . .3.4.8.3.5.2|
00000b80 00 39 00 2c 00 20 00 0d 00 0d 0a 00 20 00 20 00 |.9.,. ...... . .|
00000b90 20 00 20 00 20 00 20 00 20 00 20 00 20 00 20 00 | . . . . . . . .|
00000ba0 20 00 20 00 20 00 20 00 20 00 20 00 44 00 61 00 | . . . . . .D.a.|
00000bb0 74 00 65 00 54 00 69 00 6d 00 65 00 2e 00 50 00 |t.e.T.i.m.e...P.|
00000bc0 61 00 72 00 73 00 65 00 28 00 22 00 32 00 30 00 |a.r.s.e.(.".2.0.|
00000bd0 30 00 33 00 2d 00 30 00 31 00 2d 00 30 00 38 00 |0.3.-.0.1.-.0.8.|
00000be0 20 00 30 00 39 00 3a 00 35 00 37 00 3a 00 30 00 | .0.9.:.5.7.:.0.|
00000bf0 34 00 2e 00 39 00 35 00 37 00 22 00 29 00 2c 00 |4...9.5.7.".).,.|
00000c00 0d 00 0d 0a 00 20 00 20 00 20 00 20 00 20 00 20 |..... . . . . . |
00000c10 00 20 00 20 00 20 00 20 00 20 00 20 00 20 00 20 |. . . . . . . . |
00000c20 00 20 00 20 00 33 00 2c 00 20 00 0d 00 0d 0a 00 |. . .3.,. ......|
00000c30 20 00 20 00 20 00 20 00 20 00 20 00 20 00 20 00 | . . . . . . . .|
*
00000c50 22 00 49 00 6e 00 76 00 6f 00 69 00 63 00 65 00 |".I.n.v.o.i.c.e.|
00000c60 20 00 28 00 50 00 47 00 2d 00 50 00 53 00 29 00 | .(.P.G.-.P.S.).|
00000c70 20 00 2d 00 20 00 53 00 55 00 50 00 50 00 28 00 | .-. .S.U.P.P.(.|
00000c80 31 00 31 00 2f 00 31 00 36 00 2f 00 32 00 30 00 |1.1./.1.6./.2.0.|
00000c90 30 00 38 00 29 00 22 00 2c 00 20 00 0d 00 0d 0a |0.8.).".,. .....|
00000ca0 00 20 00 20 00 20 00 20 00 20 00 20 00 20 00 20 |. . . . . . . . |
*
00000cc0 00 22 00 46 00 49 00 4e 00 41 00 4e 00 43 00 45 |.".F.I.N.A.N.C.E|
00000cd0 00 22 00 2c 00 20 00 0d 00 0d 0a 00 20 00 20 00 |.".,. ...... . .|
00000ce0 20 00 20 00 20 00 20 00 20 00 20 00 20 00 20 00 | . . . . . . . .|
00000cf0 20 00 20 00 20 00 20 00 20 00 20 00 44 00 61 00 | . . . . . .D.a.|
00000d00 74 00 65 00 54 00 69 00 6d 00 65 00 2e 00 50 00 |t.e.T.i.m.e...P.|
00000d10 61 00 72 00 73 00 65 00 28 00 22 00 32 00 30 00 |a.r.s.e.(.".2.0.|
00000d20 30 00 38 00 2d 00 30 00 34 00 2d 00 31 00 31 00 |0.8.-.0.4.-.1.1.|
00000d30 20 00 31 00 31 00 3a 00 31 00 35 00 3a 00 30 00 | .1.1.:.1.5.:.0.|
00000d40 37 00 2e 00 37 00 37 00 30 00 22 00 29 00 2c 00 |7...7.7.0.".).,.|
00000d50 20 00 0d 00 0d 0a 00 20 00 20 00 20 00 20 00 20 | ...... . . . . |
00000d60 00 20 00 20 00 20 00 20 00 20 00 20 00 20 00 20 |. . . . . . . . |
00000d70 00 20 00 20 00 20 00 40 00 22 00 5c 00 5c 00 44 |. . . .#.".\.\.D|
00000d80 00 4f 00 53 00 4b 00 5c 00 43 00 47 00 44 00 4f |.O.S.K.\.C.G.D.O|
00000d90 00 43 00 53 00 5c 00 33 00 37 00 34 00 39 00 31 |.C.S.\.3.7.4.9.1|
00000da0 00 34 00 5c 00 34 00 36 00 30 00 34 00 36 00 5c |.4.\.4.6.0.4.6.\|
00000db0 00 62 00 69 00 6c 00 73 00 6c 00 69 00 70 00 5f |.b.i.l.s.l.i.p._|
00000dc0 00 31 00 30 00 38 00 31 00 36 00 34 00 38 00 2e |.1.0.8.1.6.4.8..|
00000dd0 00 64 00 6f 00 63 00 22 00 29 00 3b 00 0d 00 0d |.d.o.c.".).;....|
00000de0 0a 00 20 00 20 00 20 00 20 00 20 00 20 00 20 00 |.. . . . . . . .|
00000df0 20 00 7d 00 0d 00 0d 0a 00 0d 00 0d 0a 00 20 00 | .}........... .|
It appears this is some sort of encoding problem.
You're saving your files as UTF-16, the encoding that Windows text editors misleadingly call “Unicode”.
UTF-16 is not ASCII-compatible and so won't work properly with the diff tool used by git. What you're getting is a single byte change to the input on every newline (presumably due to conversion between LF and Windows CRLF line endings) causing the two-byte alignment of UTF-16 code units to be out by one, causing the low byte and high byte to be swapped:
original text: < s u m m a r y >
representation in UTF-16LE: 3C 00 73 00 75 00 6D 00 6D 00 61 00 72 00 79 00 3E 00
accidentally misaligned: 00 3C 00 73 00 75 00 6D 00 6D 00 61 00 72 00 79 00 3E
decoded from misaligned: 㰀 猀 甀 洀 洀 愀 爀 礀 㸀
Save your files in an ASCII-compatible encoding and you'll not have this trouble. Preferably: UTF-8-without-BOM.

Resources