Here's the output dumped from od -cx(on linux you can reproduce with echo -ne "\r\n\n" |od -cx):
0000000 \r \n \n \0
0a0d 000a
0000003
The correct first 2 bytes should be 0d0a but it outputs 0a0d,why?
because you're on a little-endian system? a 16-bit integer will be the high byte followed by the low byte; in this case the 2nd byte followed by the first.
Because your computer is using the so-called "little-endian" method to represent words in the memory (the x86 processor architecture is a common example of little endian systems).
Because it's reading it as shorts and not as bytes. Short is 2 bytes reversed.
Related
This question already has answers here:
hexdump output order
(2 answers)
Closed 10 months ago.
Why does hexdump print 00ff here? I expected it to print ff00 , like it got in stdin, but
$ printf "\xFF\x00" | hexdump
0000000 00ff
0000002
hexdump decided to reverse it? why?
This is because hexdump is dumping 16-bit WORDS (=2-bytes-hex) and x86 processors stores words in little-endian format (you're probably using this processor).
From Wikipedia, Endianness:
A big-endian system stores the most significant byte of a word at the
smallest memory address and the least significant byte at the largest.
A little-endian system, in contrast, stores the least-significant byte
at the smallest address.
Notice that, when you use hexdump without specifiyng a parameter, the output is similar to -x.
From hexdump, man page:
If no format strings are specified, the default
display is very similar to the -x output format (the
-x option causes more space to be used between format
units than in the default output).
...
-x, --two-bytes-hex
Two-byte hexadecimal display. Display the input offset in
hexadecimal, followed by eight space-separated, four-column,
zero-filled, two-byte quantities of input data, in
hexadecimal, per line.
If you want to dump single bytes in order, use -C parameter or specify your custom formatting with -e.
$ printf "\xFF\x00" | hexdump -C
00000000 ff 00 |?.|
00000002
$ printf "\xFF\x00" | hexdump -e '"%07.7_ax " 8/1 "%02x " "\n"'
0000000 ff 00
From the hexdump(1) man page:
If no format strings are specified, the default display is very similar to the -x output format
-x, --two-bytes-hex
Two-byte hexadecimal display. Display the input offset in hexadecimal, followed by eight space-separated, four-column, zero-filled, two-byte quantities of input data, in hexadecimal, per line.
On a little-endian host the most significant byte (here 0xFF) is listed last.
I'd like to run a command similar to:
# echo 00: 0123456789abcdef | xxd -r | od -tx1
0000000 01 23 45 67 89 ab cd ef
0000010
That is, I'd like to input a hex string and have it converted to bytes on stdout. However, I'd like it to respect byte order of the machine I'm on, which is little endian. Here's the proof:
# lscpu | grep Byte.Order
Byte Order: Little Endian
So, I'd like it to work as above if my machine was big-endian. But since it isn't, I'd like to see:
# <something different here> | od -tx1
0000000 ef cd ab 89 67 45 23 01
0000010
Now, xxd has a "-e" option for little endianess. But 1) I want machine endianess, because I'd like something that works on big or little endian machines, and 2) "-e" isn't support with "-r" anyway.
Thanks!
What about this —
$ echo 00: 0123456789abcdef | xxd -r | xxd -g 8 -e | xxd -r | od -tx1
0000000 ef cd ab 89 67 45 23 01
0000010
According to man xxd:
-e
Switch to little-endian hexdump. This option treats byte groups as words in little-endian byte order. The default grouping of 4 bytes may be changed using -g. This option only applies to hexdump, leaving the ASCII (or EBCDIC) representation unchanged. The command line switches -r, -p, -i do not work with this mode.
-g bytes | -groupsize bytes
Separate the output of every bytes bytes (two hex characters or eight bit-digits each) by a whitespace. Specify -g 0 to suppress grouping. Bytes defaults to 2 in normal mode, 4 in little-endian mode and 1 in bits mode. Grouping does not apply to postscript or include style.
I have a csv file which is just a simple comma-separated list of numbers. I want to convert this csv file into a binary file (just a sequence of bytes, with each interpreted number being a number from the csv file).
The reason I am doing this is to be able to import audio data from a spreadsheet of values. In my import (I am using audacity), I have a few formats to choose from for the binary file:
Encoding:
Signed 8, 24, 16, or 32 bit PCM
Unsigned 8 bit PCM
32 bit or 64 bit float
U-Law
A-Law
GSM 6.10
12, 16, or 24 bit DWVW
VOX ADPCM
Byte Order:
No endianness
Big endian
Little endian
I was moving along the lines of big endian 32-bit float to keep things simple. I wanted to keep things as simple as possible, so I was thinking bash would be the optimal tool.
I have a csv file which is just a simple comma-separated list of numbers. I want to convert this csv file into a binary file [...]
I was moving along the lines of big endian 32-bit float to keep things simple.
Not sure how to do it in pure bash (actually doubt that it is doable, since float as binary is non-standard conversion).
But here it is with a simple Perl one-liner:
$ cat example1.csv
1.0
2.1
3.2
4.3
$ cat example1.csv | perl -ne 'print pack("f>*", split(/\s*,\s*/))' > example1.bin
$ hexdump -C < example1.bin
00000000 3f 80 00 00 40 06 66 66 40 4c cc cd 40 89 99 9a |?...#.ff#L..#...|
00000010
It uses the Perl's pack function with f to convert floats to binary, and < to convert them into BE. (I have also added the split in case of multiple numbers per CSV line.)
P.S. The command to convert to integers to 16-bit shorts with native endianness:
perl -ne 'print pack("s*", split(/\s*,\s*/))'
Use "s>*" for BE, or "s<*" for LE, instead of the "s*".
P.P.S. If it is audio data, you can also check the sox tool. Haven't used it in ages, but IIRC it could convert anything PCM-like from literally any format to any format, while also applying effects.
I would recommend Python over bash. For this particular task, it's simpler/saner IMO.
#!/usr/bin/env python
import array
with open('input.csv', 'rt') as f:
text = f.read()
entries = text.split(',')
values = [int(x) for x in entries]
# do a scalar here: if your input goes from [-100, 100] then
# you may need to translate/scale into [0, 2^16-1] for
# 16-bit PCM
# e.g.:
# values = [(val * scale) for val in values]
with open('output.pcm', 'wb') as out:
pcm_vals = array.array('h', values) # 16-bit signed
pcm_vals.tofile(out)
You could also use Python's wave module instead of just writing raw PCM.
Here's how the example above works:
$ echo 1,2,3,4,5,6,7 > input.csv
$ ./so_pcm.py
$ xxd output.pcm
0000000: 0100 0200 0300 0400 0500 0600 0700 ..............
xxd shows the binary values. It used my machine's native endianness (little).
I came across an unusual space character the other day:
[user#server] ~ $ echo AB583 923 | od -c
0000000 A B 5 8 3 342 200 211 9 2 3 \n
0000014
[user#server] ~ $ echo AB583 923 | od -c
0000000 A B 5 8 3 9 2 3 \n
0000012
I tried to decipher it with the hexidecimal representation command, but I don't understand enough about base level data to understand what this character really is. Can anyone help me find out?
Well according to this, 342\200\211 is a thin space in Unicode.
What do you mean by "create a space character without using space"?
The value shown by od -c is in octal. The character that is represented by those three numbers has to be searched for. Getting back the numbers from octal to hex:
342 200 211 = 0xE2 0x80 0x89
Searching for utf8 0xE2 0x80 0x89 this site is found which shows that the UTF-8 byte sequence 0xE2 0x80 0x89 belongs to the UNICODE code point 2009, or simply U-02009.
That code point is named thin space, which, yes, is a character similar to U-0020, space.
So yes, there are several characters similar to an space in UNICODE, all of them valid and all of them similar to a simple space.
I just wonder: Why are you asking?.
I have created a file with UTF-8 encoding, but I don't understand the rules for the size it takes up on disk. Here is my complete research:
First I created the file with a single Hindi letter 'क' and the file size on Windows 7 was
8 bytes.
Now with two letter 'कक' and the file size was 11 bytes.
Now with three letter 'ककक'and the file size was 14 bytes.
Can someone please explain me why it is showing such sizes?
The first three bytes are used for the BOM (Byte Order Mark) EF BB BF.
Then, the bytes E0 A4 95 encode the letter क.
Then the bytes 0D 0A encode a carriage return.
Total: 8 bytes. For each letter क you add, you need three more bytes.
On linux based systems, you can use hexdump to get the hexadecimal dump(used by Tim in his answer) and understand how many bytes a character is allocating.
echo -n a | hexdump -C
echo -n क | hexdump -C
Here's the output of the above two command.